Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

many OSU benchmark tests time out on fluke login node #36

Open
garlick opened this issue Sep 30, 2021 · 2 comments
Open

many OSU benchmark tests time out on fluke login node #36

garlick opened this issue Sep 30, 2021 · 2 comments

Comments

@garlick
Copy link
Member

garlick commented Sep 30, 2021

Running TEST_LONG=t ./t1001-osu-benchmarks.t, it seems the tests that involving two "nodes" are taking a very long time to run and exceed their timeouts, much like when some of those LONGTEST tests were enabled in CI.

For example, 1n2p pt2pt/osu_latency runs in about a second, but 2n2p pt2pt/osu_latency exceeds its 5m timeout.

expecting success: run_osutest 300 2 2 pt2pt/osu_latency
# OSU MPI Latency Test
# Size            Latency (us)
0                      9882.68
not ok 5 - 2n2p pt2pt/osu_latency

Here's the output from the 1n2p one:

# OSU MPI Latency Test
# Size            Latency (us)
0                         0.23
1                         0.22
2                         0.22
4                         0.22
8                         0.22
16                        0.22
32                        0.25
64                        0.25
128                       0.34
256                       0.35
512                       0.41
1024                      0.47
2048                      0.60
4096                      0.85
8192                      1.41
16384                     2.27
32768                     3.51
65536                     5.92
131072                   10.68
262144                   20.12
524288                   39.37
1048576                  68.38
2097152                 156.03
4194304                 400.01
ok 4 - 1n2p pt2pt/osu_latency
@garlick
Copy link
Member Author

garlick commented Sep 30, 2021

some debug output with

index 6c87005..8af0fef 100755
--- a/t/t1001-osu-benchmarks.t
+++ b/t/t1001-osu-benchmarks.t
@@ -65,6 +65,7 @@ test_expect_success 'create rc.lua script' "
        cat >rc.lua <<-EOT
        plugin.load (\"$PLUGINPATH/pmix.so\")
        shell.setenv (\"OMPI_MCA_btl_tcp_if_include\", \"lo\")
+       shell.setenv (\"OMPI_MCA_btl_base_verbose\", \"100\")
        EOT
 "
expecting success: run_osutest 300 2 2 pt2pt/osu_latency
[fluke108:1597939] mca: base: components_register: registering framework btl components
[fluke108:1597939] mca: base: components_register: found loaded component self
[fluke108:1597939] mca: base: components_register: component self register function successful
[fluke108:1597939] mca: base: components_register: found loaded component ofi
[fluke108:1597939] mca: base: components_register: component ofi register function successful
[fluke108:1597939] mca: base: components_register: found loaded component sm
[fluke108:1597939] mca: base: components_register: component sm register function successful
[fluke108:1597939] mca: base: components_register: found loaded component tcp
[fluke108:1597939] mca: base: components_register: component tcp register function successful
[fluke108:1597939] mca: base: components_register: found loaded component uct
[fluke108:1597939] mca: base: components_register: component uct register function successful
[fluke108:1597939] mca: base: components_register: found loaded component usnic
[fluke108:1597939] mca: base: components_register: component usnic register function successful
[fluke108:1597939] mca: base: components_open: opening btl components
[fluke108:1597939] mca: base: components_open: found loaded component self
[fluke108:1597939] mca: base: components_open: component self open function successful
[fluke108:1597939] mca: base: components_open: found loaded component ofi
[fluke108:1597939] mca: base: components_open: component ofi open function successful
[fluke108:1597939] mca: base: components_open: found loaded component sm
[fluke108:1597939] mca: base: components_open: component sm open function successful
[fluke108:1597939] mca: base: components_open: found loaded component tcp
[fluke108:1597939] mca: base: components_open: component tcp open function successful
[fluke108:1597939] mca: base: components_open: found loaded component uct
[fluke108:1597939] mca: base: components_open: component uct open function successful
[fluke108:1597939] mca: base: components_open: found loaded component usnic
[fluke108:1597939] mca: base: components_open: component usnic open function successful
[fluke108:1597939] select: initializing btl component self
[fluke108:1597939] select: init of component self returned success
[fluke108:1597939] select: initializing btl component ofi
[fluke1080][[37950,0],0][btl_ofi_component.c:241:mca_btl_ofi_component_init] initializing ofi btl
[fluke108:1597938] mca: base: components_register: registering framework btl components
[fluke108:1597938] mca: base: components_register: found loaded component self
[fluke108:1597938] mca: base: components_register: component self register function successful
[fluke108:1597938] mca: base: components_register: found loaded component ofi
[fluke108:1597938] mca: base: components_register: component ofi register function successful
[fluke108:1597938] mca: base: components_register: found loaded component sm
[fluke108:1597938] mca: base: components_register: component sm register function successful
[fluke108:1597938] mca: base: components_register: found loaded component tcp
[fluke108:1597938] mca: base: components_register: component tcp register function successful
[fluke108:1597938] mca: base: components_register: found loaded component uct
[fluke108:1597938] mca: base: components_register: component uct register function successful
[fluke108:1597938] mca: base: components_register: found loaded component usnic
[fluke108:1597938] mca: base: components_register: component usnic register function successful
[fluke108:1597938] mca: base: components_open: opening btl components
[fluke108:1597938] mca: base: components_open: found loaded component self
[fluke108:1597938] mca: base: components_open: component self open function successful
[fluke108:1597938] mca: base: components_open: found loaded component ofi
[fluke108:1597938] mca: base: components_open: component ofi open function successful
[fluke108:1597938] mca: base: components_open: found loaded component sm
[fluke108:1597938] mca: base: components_open: component sm open function successful
[fluke108:1597938] mca: base: components_open: found loaded component tcp
[fluke108:1597938] mca: base: components_open: component tcp open function successful
[fluke108:1597938] mca: base: components_open: found loaded component uct
[fluke108:1597938] mca: base: components_open: component uct open function successful
[fluke108:1597938] mca: base: components_open: found loaded component usnic
[fluke108:1597938] mca: base: components_open: component usnic open function successful
[fluke108:1597938] select: initializing btl component self
[fluke108:1597938] select: init of component self returned success
[fluke108:1597938] select: initializing btl component ofi
[fluke1081][[37950,0],1][btl_ofi_component.c:241:mca_btl_ofi_component_init] initializing ofi btl
[fluke1080][[37950,0],0][btl_ofi_component.c:349:mca_btl_ofi_component_init] ofi btl found 34 possible resources.
[fluke1080][[37950,0],0][btl_ofi_component.c:73:validate_info] validating device: mlx4_0
[fluke1080][[37950,0],0][btl_ofi_component.c:90:validate_info] ofi_rxm does not support FI_DELIVERY_COMPLETE
[fluke1080][[37950,0],0][btl_ofi_component.c:73:validate_info] validating device: mlx4_0
[fluke1080][[37950,0],0][btl_ofi_component.c:90:validate_info] ofi_rxm does not support FI_DELIVERY_COMPLETE
[fluke1080][[37950,0],0][btl_ofi_component.c:73:validate_info] validating device: mlx4_0
[fluke1080][[37950,0],0][btl_ofi_component.c:90:validate_info] ofi_rxm does not support FI_DELIVERY_COMPLETE
[fluke1080][[37950,0],0][btl_ofi_component.c:73:validate_info] validating device: mlx4_0
[fluke1080][[37950,0],0][btl_ofi_component.c:90:validate_info] ofi_rxm does not support FI_DELIVERY_COMPLETE
[fluke1080][[37950,0],0][btl_ofi_component.c:73:validate_info] validating device: mlx4_0
[fluke1080][[37950,0],0][btl_ofi_component.c:90:validate_info] ofi_rxm does not support FI_DELIVERY_COMPLETE
[fluke1080][[37950,0],0][btl_ofi_component.c:73:validate_info] validating device: mlx4_0
[fluke1080][[37950,0],0][btl_ofi_component.c:90:validate_info] ofi_rxm does not support FI_DELIVERY_COMPLETE
[fluke1080][[37950,0],0][btl_ofi_component.c:73:validate_info] validating device: mlx4_0-dgram
[fluke1080][[37950,0],0][btl_ofi_component.c:119:validate_info] device: mlx4_0-dgram is good to go.
[fluke1080][[37950,0],0][btl_ofi_component.c:449:mca_btl_ofi_init_device] initializing dev:mlx4_0-dgram provider:verbs;ofi_rxd
[fluke1080][[37950,0],0][btl_ofi_component.c:513:mca_btl_ofi_init_device] btl/ofi using normal endpoint.
[fluke1080][[37950,0],0][btl_ofi_component.c:398:mca_btl_ofi_component_init] ofi btl initialization complete. found 1 suitable transports
[fluke108:1597939] select: init of component ofi returned success
[fluke108:1597939] select: initializing btl component sm
[fluke1080][[37950,0],0][btl_sm_component.c:322:mca_btl_sm_component_init] No peers to communicate with. Disabling sm.
[fluke108:1597939] select: init of component sm returned failure
[fluke108:1597939] mca: base: close: component sm closed
[fluke108:1597939] mca: base: close: unloading component sm
[fluke108:1597939] select: initializing btl component tcp
[fluke108:1597939] btl:tcp: 0x76d600: if lo kidx 1 cnt 0 addr 127.0.0.1 IPv4 bw 100 lt 100
[fluke108:1597939] btl:tcp: Attempting to bind to AF_INET port 1024
[fluke108:1597939] btl:tcp: Successfully bound to AF_INET port 1024
[fluke108:1597939] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[fluke108:1597939] btl: tcp: exchange: 0 1 IPv4 127.0.0.1
[fluke108:1597939] select: init of component tcp returned success
[fluke108:1597939] select: initializing btl component uct
[fluke1080][[37950,0],0][btl_uct_component.c:493:mca_btl_uct_component_init] initializing uct btl
[fluke1080][[37950,0],0][btl_uct_component.c:498:mca_btl_uct_component_init] no uct memory domains specified
[fluke108:1597939] select: init of component uct returned failure
[fluke108:1597939] mca: base: close: component uct closed
[fluke108:1597939] mca: base: close: unloading component uct
[fluke108:1597939] select: initializing btl component usnic
[fluke108:1597939] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[fluke108:1597939] select: init of component usnic returned failure
[fluke108:1597939] mca: base: close: component usnic closed
[fluke108:1597939] mca: base: close: unloading component usnic
[fluke1081][[37950,0],1][btl_ofi_component.c:349:mca_btl_ofi_component_init] ofi btl found 34 possible resources.
[fluke1081][[37950,0],1][btl_ofi_component.c:73:validate_info] validating device: mlx4_0
[fluke1081][[37950,0],1][btl_ofi_component.c:90:validate_info] ofi_rxm does not support FI_DELIVERY_COMPLETE
[fluke1081][[37950,0],1][btl_ofi_component.c:73:validate_info] validating device: mlx4_0
[fluke1081][[37950,0],1][btl_ofi_component.c:90:validate_info] ofi_rxm does not support FI_DELIVERY_COMPLETE
[fluke1081][[37950,0],1][btl_ofi_component.c:73:validate_info] validating device: mlx4_0
[fluke1081][[37950,0],1][btl_ofi_component.c:90:validate_info] ofi_rxm does not support FI_DELIVERY_COMPLETE
[fluke1081][[37950,0],1][btl_ofi_component.c:73:validate_info] validating device: mlx4_0
[fluke1081][[37950,0],1][btl_ofi_component.c:90:validate_info] ofi_rxm does not support FI_DELIVERY_COMPLETE
[fluke1081][[37950,0],1][btl_ofi_component.c:73:validate_info] validating device: mlx4_0
[fluke1081][[37950,0],1][btl_ofi_component.c:90:validate_info] ofi_rxm does not support FI_DELIVERY_COMPLETE
[fluke1081][[37950,0],1][btl_ofi_component.c:73:validate_info] validating device: mlx4_0
[fluke1081][[37950,0],1][btl_ofi_component.c:90:validate_info] ofi_rxm does not support FI_DELIVERY_COMPLETE
[fluke1081][[37950,0],1][btl_ofi_component.c:73:validate_info] validating device: mlx4_0-dgram
[fluke1081][[37950,0],1][btl_ofi_component.c:119:validate_info] device: mlx4_0-dgram is good to go.
[fluke1081][[37950,0],1][btl_ofi_component.c:449:mca_btl_ofi_init_device] initializing dev:mlx4_0-dgram provider:verbs;ofi_rxd
[fluke1081][[37950,0],1][btl_ofi_component.c:513:mca_btl_ofi_init_device] btl/ofi using normal endpoint.
[fluke1081][[37950,0],1][btl_ofi_component.c:398:mca_btl_ofi_component_init] ofi btl initialization complete. found 1 suitable transports
[fluke108:1597938] select: init of component ofi returned success
[fluke108:1597938] select: initializing btl component sm
[fluke1081][[37950,0],1][btl_sm_component.c:322:mca_btl_sm_component_init] No peers to communicate with. Disabling sm.
[fluke108:1597938] select: init of component sm returned failure
[fluke108:1597938] mca: base: close: component sm closed
[fluke108:1597938] mca: base: close: unloading component sm
[fluke108:1597938] select: initializing btl component tcp
[fluke108:1597938] btl:tcp: 0x76d620: if lo kidx 1 cnt 0 addr 127.0.0.1 IPv4 bw 100 lt 100
[fluke108:1597938] btl:tcp: Attempting to bind to AF_INET port 1024
[fluke108:1597938] btl:tcp: Attempting to bind to AF_INET port 1025
[fluke108:1597938] btl:tcp: Successfully bound to AF_INET port 1025
[fluke108:1597938] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[fluke108:1597938] btl: tcp: exchange: 0 1 IPv4 127.0.0.1
[fluke108:1597938] select: init of component tcp returned success
[fluke108:1597938] select: initializing btl component uct
[fluke1081][[37950,0],1][btl_uct_component.c:493:mca_btl_uct_component_init] initializing uct btl
[fluke1081][[37950,0],1][btl_uct_component.c:498:mca_btl_uct_component_init] no uct memory domains specified
[fluke108:1597938] select: init of component uct returned failure
[fluke108:1597938] mca: base: close: component uct closed
[fluke108:1597938] mca: base: close: unloading component uct
[fluke108:1597938] select: initializing btl component usnic
[fluke108:1597938] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[fluke108:1597938] select: init of component usnic returned failure
[fluke108:1597938] mca: base: close: component usnic closed
[fluke108:1597938] mca: base: close: unloading component usnic
# OSU MPI Latency Test
# Size            Latency (us)

@garlick
Copy link
Member Author

garlick commented Sep 30, 2021

It looks like each rank successfully bound to a tcp port

[fluke108:1597939] btl:tcp: Successfully bound to AF_INET port 1024
[fluke108:1597938] btl:tcp: Successfully bound to AF_INET port 1025

But there are no connect messages. Usually we see (from the tcp btl) something like this:

[system76-pc:885784] btl:tcp: now connected to 127.0.0.1, process [[52714,0],1]
[system76-pc:885783] btl:tcp: now connected to 127.0.0.1, process [[52714,0],0]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant