-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: reduce test & CI flakes with retries + longer waits & timeouts #11753
ci: reduce test & CI flakes with retries + longer waits & timeouts #11753
Conversation
- signal_test.go was flaking quite a bit in CI - TestStopBehavior in particular was flaking in checking onExit, so bumped that specifically - examples: https://github.com/argoproj/argo-workflows/actions/runs/6090397163/job/16525212870?pr=11752, https://github.com/argoproj/argo-workflows/actions/runs/6090397163/job/16525214144?pr=11752, https://github.com/argoproj/argo-workflows/actions/runs/6057348108/job/16438451395?pr=11727, https://github.com/argoproj/argo-workflows/actions/runs/6024154407/job/16342412510?pr=11716, https://github.com/argoproj/argo-workflows/actions/runs/5910880044/job/16032931699?pr=11568, https://github.com/argoproj/argo-workflows/actions/runs/5910880485/job/16032933770?pr=11567, https://github.com/argoproj/argo-workflows/actions/runs/6066706542/job/16457612480, https://github.com/argoproj/argo-workflows/actions/runs/6008824810/job/16297128151, https://github.com/argoproj/argo-workflows/actions/runs/6076177904/job/16483849857 - did not bump `killDuration` entirely (as 01d8cff did) as that bumps lots of other waits and makes the whole test suite a good bit longer - we may want to reduce this in some places and have more targeted wait times in order to reduce total test time - and may want to have a different timeout on CI vs. local as well Signed-off-by: Anton Gilgur <[email protected]>
Ya gotta be kidding me, a different test flaked 😭 also looks to be a timeout issue though, but in a completely different E2E test suite
|
- example: https://github.com/argoproj/argo-workflows/actions/runs/6091465134/job/16528122001?pr=11753 - could not find any other recent examples of this? Signed-off-by: Anton Gilgur <[email protected]>
StopBehavior
timeout due to flakesStopBehavior
wait due to flakes
EDIT: found some more examples of |
- it failed on another CI run despite the previous increase: https://github.com/argoproj/argo-workflows/actions/runs/6093861532/job/16534290978?pr=11753 Signed-off-by: Anton Gilgur <[email protected]>
adjfklas;dfk, Gonna bump it once more... |
so all the tests passed.... and then Codegen failed......... the whack-a-mole is real.... 😵💫 this CI error I have seen at least once before though, so at least it's not unique! 😅 Test logs for posterity
From the test logs, it is failing on The Side note: k8s 1.23.3 swagger is outdated now, should probably be updated |
- had some CI issues on `codegen` when trying to retrieve k8s swagger - it tried to decode JSON but instead got some HTML, i.e. the `curl` got a 429 rate limit or something and got an HTML page instead - `curl --fail` has `curl` exit on bad status codes, so this should now retry on a bad status code - examples: https://github.com/argoproj/argo-workflows/actions/runs/6094143833/job/16535230915?pr=11753, https://github.com/argoproj/argo-workflows/actions/runs/6084230658/job/16505949481 Signed-off-by: Anton Gilgur <[email protected]>
- it's timed out a few times, so bump it up one minute to 5m - examples: https://github.com/argoproj/argo-workflows/actions/runs/6101699486/job/16558698083?pr=11766 - can't seem to find another example but I've definitely seen it more than once before Signed-off-by: Anton Gilgur <[email protected]>
StopBehavior
wait due to flakes
All green checks has never looked so good 😭 🎉 🚀 |
For reference, my hypothesis is that we're getting some of these test flakes in CI specifically perhaps because the CI runner is bottlenecking on CPU or memory. The standard runner is 2 core CPU + 7 GB RAM. With max CPU, everything takes longer and I/O is waiting on CPU. There would also be more CPU pre-emption and switching (although I don't quite know how the goroutine implementation handles this, as they are user-land "threads"). With both of those scenarios, it would roughly make sense why CI needs longer waits etc, as the CPU just can't get to everything timely. Just a hypothesis. Either way, we can't increase the runner size in default OSS. Though I wonder if GitHub might grant an exception to a large, graduated CNCF project. |
Partial fix for #10807 / #9027
Motivation
signal_test.go was flaking quite a bit in CI
the rest are incidental, but related, fixes for flakes that occurred in CI runs for this PR itself (and that have been seen in CI runs elsewhere too)
signals_test
failure logs for posterityModifications
TestStopBehavior
in particular was flaking in checking onExit (all examples above are on the exact same line), so bumped that specificallykillDuration
entirely (as chore: Bump killDuration for signals_test.go to avoid flaky test result #11064 did) as that bumps lots of other waits and makes the whole test suite a good bit longerTestStatusCondition
flaked on this PR (see below comment), so bumped its wait as wellmake codegen
flaked on this PR (see below comment), so fixed that as welland bumped a timeout for another E2E flake on
make wait
Verification
This is only reproducible on CI unfortunately
Future Work
As mentioned above:
killDuration
/ wait per test as they increase total test timeAlso: