Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate CRI-O jobs away from kubernetes_e2e.py #32567

Open
saschagrunert opened this issue May 6, 2024 · 49 comments
Open

Migrate CRI-O jobs away from kubernetes_e2e.py #32567

saschagrunert opened this issue May 6, 2024 · 49 comments
Assignees
Labels
priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@saschagrunert
Copy link
Member

saschagrunert commented May 6, 2024

The kubernetes_e2e.py script is deprecated and we should use kubetest2 instead.

All affected tests are listed in https://testgrid.k8s.io/sig-node-cri-o

cc @kubernetes/sig-node-cri-o-test-maintainers

Ref: https://github.com/kubernetes/test-infra/tree/master/scenarios, #20760

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 6, 2024
@haircommander
Copy link
Contributor

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 6, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 4, 2024
@saschagrunert
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 5, 2024
@kannon92
Copy link
Contributor

/triage accepted
/priority important-longterm

@kannon92 kannon92 moved this from Triage to Issues - To do in SIG Node CI/Test Board Aug 21, 2024
@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Aug 21, 2024
@elieser1101
Copy link
Contributor

Does this still need help? can i start looking at it?

@saschagrunert
Copy link
Member Author

@elieser1101 I'd appreciate your eyes on that. 🙏

@elieser1101
Copy link
Contributor

/assign

@elieser1101
Copy link
Contributor

elieser1101 commented Jan 3, 2025

So, kubetest2 changes --timeout 300m to ginkgo's --timeout=180m for some reason. Do you have any idea why?

I have seen that before, and I cant point to the WHY is that. but i think is more of test-e2e-node.sh and e2e_node/remote/remote.go change

is not this one ?
https://github.com/kubernetes-sigs/kubetest2/blob/22d5b1410bef09ae679fa5813a5f0d196b6079de/pkg/testers/node/node.go#L73

Yeah that is the flag we are using(tester flags), but then under the hood, the rabithole transforms the timeout in several places

When we pass to kubetest2 --timeout=300m we got this

Running the command ssh, with args: [-o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o LogLevel=ERROR -i /root/.ssh/google_compute_engine [email protected] -- sudo /bin/bash -c 'cd /tmp/node-e2e-20250103T183438 && set -o pipefail; timeout -k 30s 18000.000000s ./ginkgo -timeout=24h -focus="\[NodeFeature:Eviction\]"  -skip=""""  --no-color -v --timeout=180m ./e2e_node.test -- --system-spec-name= --system-spec-file= --extra-envs= --runtime-config= --v 4 --node-name=test-fedora-coreos-41-20241122-3-0-gcp-x86-64 --report-dir=/tmp/node-e2e-20250103T183438/results --report-prefix=fedora --image-description="fedora-coreos-41-20241122-3-0-gcp-x86-64" --kubelet-flags="--cluster-domain=cluster.local" --dns-domain="cluster.local" --prepull-images=false  --container-runtime-endpoint=unix:///run/containerd/containerd.sock --container-runtime-endpoint=unix:///var/run/crio/crio.sock --container-runtime-process-name=/usr/local/bin/crio --container-runtime-pid-file= --kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service" --extra-log="{\"name\": \"crio.log\", \"journalctl\": [\"-u\", \"crio\"]}" 2>&1 | tee -i /tmp/node-e2e-20250103T183438/results/test-fedora-coreos-41-20241122-3-0-gcp-x86-64-ginkgo.log']
  • Which results in a process timeout of 18000.000000s
  • also test-e2e-node.sh introduces -timeout=24h no matter if you pass other timeout
  • And finaly the timeout we specified but trimmed by the remote.go resulting in --timeout=180m

so setting up 300min -> (300 + 60) /2 = 180min passed to ginkgo

@bart0sh
Copy link
Contributor

bart0sh commented Jan 3, 2025

I hope that timeout recalculation has some reason. It's not obvious, but hopefully it exists :)

BTW, increasing timeout helped the job, but not fixed it. One test case still fails.

@kannon92 @elieser1101 Any ideas how to fix it?

@elieser1101
Copy link
Contributor

Is it possible the test itself is flaky? I can see the nonkubetest2 works intermittently and also can found one run with a similar error to the job running with kubetest2

@bart0sh

@kannon92
Copy link
Contributor

kannon92 commented Jan 6, 2025

eviction crio tests have some issues. I wouldn't worry about that.

@bart0sh
Copy link
Contributor

bart0sh commented Jan 7, 2025

Is it possible the test itself is flaky?

Could be, but I've never managed to run -kubetest2 tests without failure. non-kubetest2 tests are almost always green.

eviction crio tests have some issues.

It's probably off-topic here, so feel free to ignore.
I've noticed unexpectedly long timeouts in e2e eviction test cases. Is it considered normal for eviction to start 10 minutes after the issue (disk/pid pressure) started to manifest itself?

$ grep 'pressureTimeout :=' test/e2e_node/eviction_test.go 
        pressureTimeout := 15 * time.Minute
        pressureTimeout := 10 * time.Minute
        pressureTimeout := 10 * time.Minute
        pressureTimeout := 15 * time.Minute
        pressureTimeout := 10 * time.Minute
        pressureTimeout := 10 * time.Minute
        pressureTimeout := 10 * time.Minute
        pressureTimeout := 15 * time.Minute
        pressureTimeout := 10 * time.Minute

@elieser1101
Copy link
Contributor

elieser1101 commented Jan 16, 2025

Opened PR #34164 to promote the kubetest2 jobs that have been consistently working, pendings for rework are still

Evented pleg where non kubetest seem not working

Hugepages

Eviction

REsource manager

@SergeyKanzhelev
Copy link
Member

Discussed this at node CI meeting. @elieser1101 curious if you are still working on this?

@elieser1101
Copy link
Contributor

been doin other stuff around sig-release, I can reprioritize this and spent some hours to move this forward, i have a couple of things ready related to this.
did something change about this issue during the discussion? @SergeyKanzhelev

@SergeyKanzhelev
Copy link
Member

Nothing specific. The question was basically whether this is still in progress or new owner is needed. Slow progress is OK.

@bart0sh
Copy link
Contributor

bart0sh commented Feb 20, 2025

@elieser1101 @kannon92 @ffromani @swatisehgal

pr-crio-cgroupv1-node-e2e-resource-managers-kubetest2 green but seem to skip everything pr-crio-cgroupv2-node-e2e-resource-managers-kubetest2

They're not green since Feb 13 2025. Is there an issue about it?

@swatisehgal
Copy link
Contributor

swatisehgal commented Feb 20, 2025

@elieser1101 @kannon92 @ffromani @swatisehgal

pr-crio-cgroupv1-node-e2e-resource-managers-kubetest2 green but seem to skip everything pr-crio-cgroupv2-node-e2e-resource-managers-kubetest2

They're not green since Feb 13 2025. Is there an issue about it?

We have failures since kubernetes/kubernetes#127525 was merged. We have a tracking issue for this: kubernetes/kubernetes#130146 and a fix in place kubernetes/kubernetes#130163 which is being reviewed.

@elieser1101
Copy link
Contributor

once eventedpleg gets in place the only missing presubmits are the eviction ones

@bart0sh
Copy link
Contributor

bart0sh commented Feb 21, 2025

@elieser1101 Thanks for the reminder. I was distracted from the eviction job investigation by other tasks. Now I'm back to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Issues - In progress
Development

No branches or pull requests