[4.19-9.6]: `ostree.sync` kola test fails on `s390x` #1720

marmijo · 2025-01-28T18:51:30Z

In s390x, the ostree.sync kola test is failing in the 4.19-9.6 stream with the following output.
This test passed successfully in a recent run, but fails most of the time.

[2025-01-27T16:43:40.293Z] --- FAIL: ostree.sync (604.38s)
[2025-01-27T16:43:40.293Z]         sync.go:201: Got NFS mount.
[2025-01-27T16:43:40.293Z]         sync.go:229: Set link down and rebooting.
[2025-01-27T16:43:40.293Z]         cluster.go:151: Running as unit: run-rd0aaf58e6eb14d4fb456769f3bbb393f.service
[2025-01-27T16:43:40.293Z]         harness.go:106: TIMEOUT[10m0s]: ssh: sudo sh /usr/local/bin/nfs-random-write.sh
[2025-01-27T16:43:40.293Z]         harness.go:106: TIMEOUT[10m0s]: ssh: cat /proc/cmdline
[2025-01-27T16:43:40.293Z] FAIL, output in /home/jenkins/agent/workspace/build-arch/tmp/kola-0nsQt/kola/rerun

The journal log of one of the recent failed s90x build jobs shows:

Jan 27 16:33:36.020474 coreos-teardown-initramfs.service[1276]: info: taking down network device: enc3
Jan 27 16:33:36.021223 coreos-teardown-initramfs.service[1296]: RTNETLINK answers: Operation not supported
Jan 27 16:33:36.023099 coreos-teardown-initramfs.service[1276]: info: flushing all routing
Jan 27 16:33:36.025557 coreos-teardown-initramfs.service[1276]: info: no initramfs hostname information to propagate
Jan 27 16:33:36.027205 coreos-teardown-initramfs.service[1276]: info: no networking config is defined in the real root

journal.txt

console.txt

The text was updated successfully, but these errors were encountered:

This test is failing intermittently on s390x. Let's snooze it for now to unblock the pipeline while we investigate: openshift#1720

marmijo · 2025-01-28T18:59:14Z

Also, for completeness, this test was recently added to coreos-assembler in: coreos/coreos-assembler#3998

This test is failing intermittently on s390x. Let's snooze it for now to unblock the pipeline while we investigate: #1720

This test is failing intermittently on s390x. Let's snooze it for now to unblock the pipeline while we investigate: openshift#1720

HuijingHei · 2025-02-10T10:49:22Z

I will take it, @marmijo can you sync this to jira and assign it to me, I have no permission to do the sync, thanks!

marmijo · 2025-02-10T15:22:13Z

/jira

jlebon · 2025-02-10T15:53:22Z

I think you're looking for:
/label jira

@HuijingHei, next time try that yourself too. If you don't have access, we should look into it.

openshift-ci · 2025-02-10T15:53:39Z

@jlebon: The label(s) /label jira cannot be applied. These labels are supported: acknowledge-critical-fixes-only, platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, no-qe, downstream-change-needed, rebase/manual, cluster-config-api-changed, approved, backport-risk-assessed, bugzilla/valid-bug, cherry-pick-approved, jira/valid-bug, staff-eng-approved. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

I think you're looking for:
/label jira

@HuijingHei, next time try that yourself too. If you don't have access, we should look into it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jlebon · 2025-02-10T15:54:39Z

Hmm, I think we still have to wire it up for issues. Anyway, added it manually for now.

HuijingHei · 2025-02-19T13:57:17Z

Run ostree.sync on rhcos-419.96.202502180531-0 with s390x, the failed rate is 3/5, and check the failed log (on test vm), stuck in:

A stop job is running for OSTree Fi_ed Deployment (9min 6s / no limit)

I thought it was because of low memory, and update memory of test VM to 8G (default 2G), but no lucky.

console.txt

jmarrero · 2025-02-20T02:41:43Z

I wonder if the changes in
ostreedev/ostree#2968
&
ostreedev/ostree#2969

act different on s390x...

Do we see anything like

ot_journal_print (LOG_INFO, "Completed global sync()");

in the journal or anything interesting around that output?

In your console log I see

�M
�[K[     �[0;31m*�[0m] A stop job is running for OSTree Fi…d Deployment (7min 36s / no limit)
[  489.660159] INFO: task (sd-sync):2546 blocked for more than 368 seconds.

Not sure if that is the same sync we are calling.

HuijingHei · 2025-02-20T11:21:14Z

I wonder if the changes in ostreedev/ostree#2968 & ostreedev/ostree#2969

act different on s390x...

Do we see anything like
ot_journal_print (LOG_INFO, "Completed global sync()");
in the journal or anything interesting around that output?

No, seems hang in ostree-finalize-staged.service, can not see any other useful journal log.

In your console log I see

�M
�[K[     �[0;31m*�[0m] A stop job is running for OSTree Fi…d Deployment (7min 36s / no limit)
[  489.660159] INFO: task (sd-sync):2546 blocked for more than 368 seconds.

Not sure if that is the same sync we are calling.

[  366.780159] INFO: task dd:2438 blocked for more than 245 seconds.
^M[  366.780184]       Not tainted 5.14.0-570.el9.s390x #1
^M[  366.780186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
^M[  366.780388] INFO: task dd:2439 blocked for more than 245 seconds.
..
^M[  366.780460] INFO: task dd:2471 blocked for more than 245 seconds.
..
^M[  366.780530] INFO: task (sd-sync):2505 blocked for more than 245 seconds.
..
^M[  366.780562] INFO: task zipl:2546 blocked for more than 245 seconds.

Not sure if it is blocked by dd, sd-sync, zipl. IMU, if reach the timeout (DefaultTimeoutStopSec=10s), will kill all process and restart.

dustymabe · 2025-03-03T17:54:56Z

The fact that we set the NIC down is problematic because IIUC we don't get any more logs from the system after that happens. Can we just drop NFS port traffic instead?

--git a/mantle/kola/tests/ostree/sync.go b/mantle/kola/tests/ostree/sync.go
index 0fc36d6f6..077fb6392 100644
--- a/mantle/kola/tests/ostree/sync.go
+++ b/mantle/kola/tests/ostree/sync.go
@@ -217,7 +217,9 @@ storage:
 
 func doSyncTest(c cluster.TestCluster, client platform.Machine) {
        c.RunCmdSync(client, "sudo touch /var/tmp/data3/test")
-       // Continue writing while doing test
+       // I wonder if this would be better if the script itself was just
+       // an infinite loop and gets run by a systemd unit that we can
+       // `systemctl start` here instead of running it in a go func()
        go func() {
                _, err := c.SSH(client, "sudo sh /usr/local/bin/nfs-random-write.sh")
                if err != nil {
@@ -225,18 +227,11 @@ func doSyncTest(c cluster.TestCluster, client platform.Machine) {
                }
        }()
 
-       // Create a stage deploy using kargs while writing
-       c.RunCmdSync(client, "sudo rpm-ostree kargs --append=test=1")
+       // block NFS traffic
+       c.RunCmdSync(client, "sudo iptables <drop NFS port traffic>")
 
-       netdevices := c.MustSSH(client, "ls /sys/class/net | grep -v lo")
-       netdevice := string(netdevices)
-       if netdevice == "" {
-               c.Fatalf("failed to get net device")
-       }
-       c.Log("Set link down and rebooting.")
-       // Skip the error check as it is expected
-       cmd := fmt.Sprintf("sudo systemd-run sh -c 'ip link set %s down && sleep 2 && systemctl reboot'", netdevice)
-       _, _ = c.SSH(client, cmd)
+       // Create a stage deploy using kargs while writing
+       c.RunCmdSync(client, "sudo systemd-run sh -c 'rpm-ostree kargs --append=test=1 --reboot'")
 
        time.Sleep(5 * time.Second)
        err := util.Retry(8, 10*time.Second, func() error {

marmijo added a commit to marmijo/os that referenced this issue Jan 28, 2025

denylist: snooze ostree.sync on s390x

757c4e0

This test is failing intermittently on s390x. Let's snooze it for now to unblock the pipeline while we investigate: openshift#1720

marmijo mentioned this issue Jan 28, 2025

NO-JIRA: denylist: snooze ostree.sync on s390x #1721

Merged

dustymabe pushed a commit that referenced this issue Jan 28, 2025

denylist: snooze ostree.sync on s390x

bae5f3d

This test is failing intermittently on s390x. Let's snooze it for now to unblock the pipeline while we investigate: #1720

c4rt0 pushed a commit to c4rt0/os that referenced this issue Jan 29, 2025

denylist: snooze ostree.sync on s390x

10ac28f

This test is failing intermittently on s390x. Let's snooze it for now to unblock the pipeline while we investigate: openshift#1720

jlebon added the jira label Feb 10, 2025

jlebon mentioned this issue Feb 10, 2025

NO-JIRA: common.yaml: re-enable composefs on el9 #1735

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4.19-9.6]: `ostree.sync` kola test fails on `s390x` #1720

[4.19-9.6]: `ostree.sync` kola test fails on `s390x` #1720

marmijo commented Jan 28, 2025

marmijo commented Jan 28, 2025

HuijingHei commented Feb 10, 2025

marmijo commented Feb 10, 2025

jlebon commented Feb 10, 2025

openshift-ci bot commented Feb 10, 2025

jlebon commented Feb 10, 2025

HuijingHei commented Feb 19, 2025 •

edited

Loading

jmarrero commented Feb 20, 2025 •

edited

Loading

HuijingHei commented Feb 20, 2025

dustymabe commented Mar 3, 2025

[4.19-9.6]: ostree.sync kola test fails on s390x #1720

[4.19-9.6]: ostree.sync kola test fails on s390x #1720

Comments

marmijo commented Jan 28, 2025

marmijo commented Jan 28, 2025

HuijingHei commented Feb 10, 2025

marmijo commented Feb 10, 2025

jlebon commented Feb 10, 2025

openshift-ci bot commented Feb 10, 2025

jlebon commented Feb 10, 2025

HuijingHei commented Feb 19, 2025 • edited Loading

jmarrero commented Feb 20, 2025 • edited Loading

HuijingHei commented Feb 20, 2025

dustymabe commented Mar 3, 2025

[4.19-9.6]: `ostree.sync` kola test fails on `s390x` #1720

[4.19-9.6]: `ostree.sync` kola test fails on `s390x` #1720

HuijingHei commented Feb 19, 2025 •

edited

Loading

jmarrero commented Feb 20, 2025 •

edited

Loading