Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.19-9.6]: ostree.sync kola test fails on s390x #1720

Open
marmijo opened this issue Jan 28, 2025 · 10 comments
Open

[4.19-9.6]: ostree.sync kola test fails on s390x #1720

marmijo opened this issue Jan 28, 2025 · 10 comments
Labels

Comments

@marmijo
Copy link
Contributor

marmijo commented Jan 28, 2025

In s390x, the ostree.sync kola test is failing in the 4.19-9.6 stream with the following output.
This test passed successfully in a recent run, but fails most of the time.

[2025-01-27T16:43:40.293Z] --- FAIL: ostree.sync (604.38s)
[2025-01-27T16:43:40.293Z]         sync.go:201: Got NFS mount.
[2025-01-27T16:43:40.293Z]         sync.go:229: Set link down and rebooting.
[2025-01-27T16:43:40.293Z]         cluster.go:151: Running as unit: run-rd0aaf58e6eb14d4fb456769f3bbb393f.service
[2025-01-27T16:43:40.293Z]         harness.go:106: TIMEOUT[10m0s]: ssh: sudo sh /usr/local/bin/nfs-random-write.sh
[2025-01-27T16:43:40.293Z]         harness.go:106: TIMEOUT[10m0s]: ssh: cat /proc/cmdline
[2025-01-27T16:43:40.293Z] FAIL, output in /home/jenkins/agent/workspace/build-arch/tmp/kola-0nsQt/kola/rerun

The journal log of one of the recent failed s90x build jobs shows:

Jan 27 16:33:36.020474 coreos-teardown-initramfs.service[1276]: info: taking down network device: enc3
Jan 27 16:33:36.021223 coreos-teardown-initramfs.service[1296]: RTNETLINK answers: Operation not supported
Jan 27 16:33:36.023099 coreos-teardown-initramfs.service[1276]: info: flushing all routing
Jan 27 16:33:36.025557 coreos-teardown-initramfs.service[1276]: info: no initramfs hostname information to propagate
Jan 27 16:33:36.027205 coreos-teardown-initramfs.service[1276]: info: no networking config is defined in the real root

journal.txt

console.txt

marmijo added a commit to marmijo/os that referenced this issue Jan 28, 2025
This test is failing intermittently on s390x. Let's snooze it for now
to unblock the pipeline while we investigate:
openshift#1720
@marmijo
Copy link
Contributor Author

marmijo commented Jan 28, 2025

Also, for completeness, this test was recently added to coreos-assembler in: coreos/coreos-assembler#3998

dustymabe pushed a commit that referenced this issue Jan 28, 2025
This test is failing intermittently on s390x. Let's snooze it for now
to unblock the pipeline while we investigate:
#1720
c4rt0 pushed a commit to c4rt0/os that referenced this issue Jan 29, 2025
This test is failing intermittently on s390x. Let's snooze it for now
to unblock the pipeline while we investigate:
openshift#1720
@HuijingHei
Copy link
Contributor

I will take it, @marmijo can you sync this to jira and assign it to me, I have no permission to do the sync, thanks!

@marmijo
Copy link
Contributor Author

marmijo commented Feb 10, 2025

/jira

@jlebon
Copy link
Member

jlebon commented Feb 10, 2025

I think you're looking for:
/label jira

@HuijingHei, next time try that yourself too. If you don't have access, we should look into it.

Copy link
Contributor

openshift-ci bot commented Feb 10, 2025

@jlebon: The label(s) /label jira cannot be applied. These labels are supported: acknowledge-critical-fixes-only, platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, no-qe, downstream-change-needed, rebase/manual, cluster-config-api-changed, approved, backport-risk-assessed, bugzilla/valid-bug, cherry-pick-approved, jira/valid-bug, staff-eng-approved. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

I think you're looking for:
/label jira

@HuijingHei, next time try that yourself too. If you don't have access, we should look into it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jlebon jlebon added the jira label Feb 10, 2025
@jlebon
Copy link
Member

jlebon commented Feb 10, 2025

Hmm, I think we still have to wire it up for issues. Anyway, added it manually for now.

@HuijingHei
Copy link
Contributor

HuijingHei commented Feb 19, 2025

Run ostree.sync on rhcos-419.96.202502180531-0 with s390x, the failed rate is 3/5, and check the failed log (on test vm), stuck in:

A stop job is running for OSTree Fi_ed Deployment (9min 6s / no limit)

I thought it was because of low memory, and update memory of test VM to 8G (default 2G), but no lucky.

console.txt

@jmarrero
Copy link
Member

jmarrero commented Feb 20, 2025

I wonder if the changes in
ostreedev/ostree#2968
&
ostreedev/ostree#2969

act different on s390x...

Do we see anything like

ot_journal_print (LOG_INFO, "Completed global sync()");

in the journal or anything interesting around that output?

In your console log I see

�M
�[K[     �[0;31m*�[0m] A stop job is running for OSTree Fi…d Deployment (7min 36s / no limit)
[  489.660159] INFO: task (sd-sync):2546 blocked for more than 368 seconds.

Not sure if that is the same sync we are calling.

@HuijingHei
Copy link
Contributor

I wonder if the changes in ostreedev/ostree#2968 & ostreedev/ostree#2969

act different on s390x...

Do we see anything like

ot_journal_print (LOG_INFO, "Completed global sync()");

in the journal or anything interesting around that output?

No, seems hang in ostree-finalize-staged.service, can not see any other useful journal log.

In your console log I see

�M
�[K[     �[0;31m*�[0m] A stop job is running for OSTree Fi…d Deployment (7min 36s / no limit)
[  489.660159] INFO: task (sd-sync):2546 blocked for more than 368 seconds.

Not sure if that is the same sync we are calling.

[  366.780159] INFO: task dd:2438 blocked for more than 245 seconds.
^M[  366.780184]       Not tainted 5.14.0-570.el9.s390x #1
^M[  366.780186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
^M[  366.780388] INFO: task dd:2439 blocked for more than 245 seconds.
..
^M[  366.780460] INFO: task dd:2471 blocked for more than 245 seconds.
..
^M[  366.780530] INFO: task (sd-sync):2505 blocked for more than 245 seconds.
..
^M[  366.780562] INFO: task zipl:2546 blocked for more than 245 seconds.

Not sure if it is blocked by dd, sd-sync, zipl. IMU, if reach the timeout (DefaultTimeoutStopSec=10s), will kill all process and restart.

@dustymabe
Copy link
Member

The fact that we set the NIC down is problematic because IIUC we don't get any more logs from the system after that happens. Can we just drop NFS port traffic instead?

--git a/mantle/kola/tests/ostree/sync.go b/mantle/kola/tests/ostree/sync.go
index 0fc36d6f6..077fb6392 100644
--- a/mantle/kola/tests/ostree/sync.go
+++ b/mantle/kola/tests/ostree/sync.go
@@ -217,7 +217,9 @@ storage:
 
 func doSyncTest(c cluster.TestCluster, client platform.Machine) {
        c.RunCmdSync(client, "sudo touch /var/tmp/data3/test")
-       // Continue writing while doing test
+       // I wonder if this would be better if the script itself was just
+       // an infinite loop and gets run by a systemd unit that we can
+       // `systemctl start` here instead of running it in a go func()
        go func() {
                _, err := c.SSH(client, "sudo sh /usr/local/bin/nfs-random-write.sh")
                if err != nil {
@@ -225,18 +227,11 @@ func doSyncTest(c cluster.TestCluster, client platform.Machine) {
                }
        }()
 
-       // Create a stage deploy using kargs while writing
-       c.RunCmdSync(client, "sudo rpm-ostree kargs --append=test=1")
+       // block NFS traffic
+       c.RunCmdSync(client, "sudo iptables <drop NFS port traffic>")
 
-       netdevices := c.MustSSH(client, "ls /sys/class/net | grep -v lo")
-       netdevice := string(netdevices)
-       if netdevice == "" {
-               c.Fatalf("failed to get net device")
-       }
-       c.Log("Set link down and rebooting.")
-       // Skip the error check as it is expected
-       cmd := fmt.Sprintf("sudo systemd-run sh -c 'ip link set %s down && sleep 2 && systemctl reboot'", netdevice)
-       _, _ = c.SSH(client, cmd)
+       // Create a stage deploy using kargs while writing
+       c.RunCmdSync(client, "sudo systemd-run sh -c 'rpm-ostree kargs --append=test=1 --reboot'")
 
        time.Sleep(5 * time.Second)
        err := util.Retry(8, 10*time.Second, func() error {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants