ci: reduce test & CI flakes with retries + longer waits & timeouts #11753

agilgur5 · 2023-09-06T01:06:26Z

Partial fix for #10807 / #9027

Motivation

signal_test.go was flaking quite a bit in CI
- examples (all same test, same error): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
  - the most frequent test flake I've seen by a good margin (~20x (?) more than any other one, anecdotally)
the rest are incidental, but related, fixes for flakes that occurred in CI runs for this PR itself (and that have been seen in CI runs elsewhere too)

signals_test failure logs for posterity

  stop-terminate-4lf2q : Failed Stopped with strategy 'Stop'
    signals_test.go:44: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:44
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:36
        	Error:      	Not equal: 
        	            	expected: "Succeeded"
        	            	actual  : "Running"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(v1alpha1.NodePhase) (len=9) "Succeeded"
        	            	+(v1alpha1.NodePhase) (len=7) "Running"
        	            	 
        	Test:       	TestSignalsSuite/TestStopBehavior
=== FAIL: SignalsSuite/TestStopBehavior

Modifications

TestStopBehavior in particular was flaking in checking onExit (all examples above are on the exact same line), so bumped that specifically
- did not bump killDuration entirely (as chore: Bump killDuration for signals_test.go to avoid flaky test result #11064 did) as that bumps lots of other waits and makes the whole test suite a good bit longer
  - we may want to reduce this in some places and have more targeted wait times in order to reduce total test time
  - and may want to have a different wait on CI vs. local as well
TestStatusCondition flaked on this PR (see below comment), so bumped its wait as well
make codegen flaked on this PR (see below comment), so fixed that as well
and bumped a timeout for another E2E flake on make wait
- it's timed out a few times, so bump it up one minute to 5m
  - examples: 1, 2

Verification

This is only reproducible on CI unfortunately

Future Work

As mentioned above:

More granular killDuration / wait per test as they increase total test time
Different wait times on CI vs local

Also:

Bump k8s swagger as mentioned below
Authenticate some of the GH requests to get rate limited less? It would be great if GH recognized it was coming from itself and increased limits without needing to re-auth

agilgur5 · 2023-09-06T05:53:35Z

Ya gotta be kidding me, a different test flaked 😭

also looks to be a timeout issue though, but in a completely different E2E test suite

TestStatusCondition failure log for posterity

http-template-condition-68dv5 :  
    agent_test.go:146: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:146
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:145
        	Error:      	Not equal: 
        	            	expected: "Failed"
        	            	actual  : ""
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(v1alpha1.WorkflowPhase) (len=6) "Failed"
        	            	+(v1alpha1.WorkflowPhase) ""
        	            	 
        	Test:       	TestAgentSuite/TestStatusCondition
    agent_test.go:149: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:149
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:145
        	Error:      	Expected value not to be nil.
        	Test:       	TestAgentSuite/TestStatusCondition
    agent_test.go:1[53](https://github.com/argoproj/argo-workflows/actions/runs/6081856023/job/16498450728#step:16:54): 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:153
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:145
        	Error:      	Expected value not to be nil.
        	Test:       	TestAgentSuite/TestStatusCondition
    agent_test.go:1[57](https://github.com/argoproj/argo-workflows/actions/runs/6081856023/job/16498450728#step:16:58): 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:157
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:145
        	Error:      	Expected value not to be nil.
        	Test:       	TestAgentSuite/TestStatusCondition
    agent_test.go:1[61](https://github.com/argoproj/argo-workflows/actions/runs/6081856023/job/16498450728#step:16:62): 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:161
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:[68](https://github.com/argoproj/argo-workflows/actions/runs/6081856023/job/16498450728#step:16:69)
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:145
        	Error:      	Expected value not to be nil.
        	Test:       	TestAgentSuite/TestStatusCondition
=== FAIL: AgentSuite/TestStatusCondition

- example: https://github.com/argoproj/argo-workflows/actions/runs/6091465134/job/16528122001?pr=11753 - could not find any other recent examples of this? Signed-off-by: Anton Gilgur <[email protected]>

agilgur5 · 2023-09-06T07:00:42Z

Annnd then I checked ~50+ recent E2E test failures and none of them match the new TestStatusCondition flake (see above comment). So, at least by my recent count, this is the first time this specific test has flaked 😕 😐 😭

EDIT: found some more examples of TestStatusCondition failing after this was merged: 1, 2, 3

- it failed on another CI run despite the previous increase: https://github.com/argoproj/argo-workflows/actions/runs/6093861532/job/16534290978?pr=11753 Signed-off-by: Anton Gilgur <[email protected]>

agilgur5 · 2023-09-06T07:37:21Z

adjfklas;dfk, StopBehavior failed again despite the bump... 😐

Gonna bump it once more...

agilgur5 · 2023-09-06T20:07:35Z

so all the tests passed.... and then Codegen failed......... the whack-a-mole is real.... 😵‍💫

this CI error I have seen at least once before though, so at least it's not unique! 😅

Examples: 1, 2

Test logs for posterity

go run ./hack/swagger kubeifyswagger dist/swaggifed.swagger.json dist/kubeified.swagger.json
panic: invalid character '<' looking for beginning of value

goroutine 1 [running]:
main.getKubernetesSwagger()
	/home/runner/work/argo-workflows/argo-workflows/hack/swagger/kubeifyswagger.go:90 +0xe5
main.kubeifySwagger({0x7ffc2e045c59?, 0x1704f78?}, {0x7ffc2e045c75, 0x1b})
	/home/runner/work/argo-workflows/argo-workflows/hack/swagger/kubeifyswagger.go:28 +0x3ca
main.main()
	/home/runner/work/argo-workflows/argo-workflows/hack/swagger/main.go:10 +0x85
exit status 2
make: *** [Makefile:651: dist/kubeified.swagger.json] Error 1

From the test logs, it is failing on make dist/kubeified.swagger.json with this line specifically erroring on decoding k8s Swagger JSON.
It's erroring on a < -- i.e. the curl for this got back HTML instead of JSON. The full HTML is not output but it's likely from a rate limit error or something from GitHub. Some 400+ HTTP status code probably (429 is likely).

The recurl.sh script does not handle curl getting a bad status code, so will need to modify that slightly to account not just for curl failing but for the response being a bad status code as well.

Side note: k8s 1.23.3 swagger is outdated now, should probably be updated

- had some CI issues on `codegen` when trying to retrieve k8s swagger - it tried to decode JSON but instead got some HTML, i.e. the `curl` got a 429 rate limit or something and got an HTML page instead - `curl --fail` has `curl` exit on bad status codes, so this should now retry on a bad status code - examples: https://github.com/argoproj/argo-workflows/actions/runs/6094143833/job/16535230915?pr=11753, https://github.com/argoproj/argo-workflows/actions/runs/6084230658/job/16505949481 Signed-off-by: Anton Gilgur <[email protected]>

- it's timed out a few times, so bump it up one minute to 5m - examples: https://github.com/argoproj/argo-workflows/actions/runs/6101699486/job/16558698083?pr=11766 - can't seem to find another example but I've definitely seen it more than once before Signed-off-by: Anton Gilgur <[email protected]>

agilgur5 · 2023-09-06T22:40:37Z

All green checks has never looked so good 😭 🎉 🚀

agilgur5 · 2023-09-07T16:57:03Z

For reference, my hypothesis is that we're getting some of these test flakes in CI specifically perhaps because the CI runner is bottlenecking on CPU or memory. The standard runner is 2 core CPU + 7 GB RAM.
Given that CI is running DinD with a k3d cluster and parallelized tests, I would be surprised if CPU isn't maxed out and memory might be close to max. I typically give my devcontainer up to 6 CPU cores and 8GB RAM, as an anecdotal comparison.

With max CPU, everything takes longer and I/O is waiting on CPU. There would also be more CPU pre-emption and switching (although I don't quite know how the goroutine implementation handles this, as they are user-land "threads").

With both of those scenarios, it would roughly make sense why CI needs longer waits etc, as the CPU just can't get to everything timely.

Just a hypothesis. Either way, we can't increase the runner size in default OSS. Though I wonder if GitHub might grant an exception to a large, graduated CNCF project.

fix(test): also bump agent StatusCondition wait

fc0f9e6

- example: https://github.com/argoproj/argo-workflows/actions/runs/6091465134/job/16528122001?pr=11753 - could not find any other recent examples of this? Signed-off-by: Anton Gilgur <[email protected]>

agilgur5 changed the title ~~fix(test): bump signals StopBehavior timeout due to flakes~~ fix(test): bump signals StopBehavior wait due to flakes Sep 6, 2023

fix(test): bump StopBehavior again

588c625

- it failed on another CI run despite the previous increase: https://github.com/argoproj/argo-workflows/actions/runs/6093861532/job/16534290978?pr=11753 Signed-off-by: Anton Gilgur <[email protected]>

agilgur5 added the area/build Build or GithubAction/CI issues label Sep 6, 2023

agilgur5 added 2 commits September 6, 2023 16:27

agilgur5 changed the title ~~fix(test): bump signals StopBehavior wait due to flakes~~ fix: reduce test & CI flakes with retries + longer waits & timeouts Sep 6, 2023

agilgur5 requested a review from terrytangyuan September 6, 2023 22:40

jmeridth approved these changes Sep 7, 2023

View reviewed changes

terrytangyuan approved these changes Sep 7, 2023

View reviewed changes

terrytangyuan changed the title ~~fix: reduce test & CI flakes with retries + longer waits & timeouts~~ ci: reduce test & CI flakes with retries + longer waits & timeouts Sep 7, 2023

terrytangyuan enabled auto-merge (squash) September 7, 2023 01:22

terrytangyuan merged commit 1f2e822 into argoproj:master Sep 7, 2023
29 checks passed

agilgur5 deleted the fix-ci-bump-signals-test branch September 7, 2023 01:51

This was referenced Sep 11, 2023

refactor(ui): migrate UserInfo to functional component #11793

Merged

docs: remove no longer needed spelling exceptions #11814

Merged

agilgur5 mentioned this pull request Sep 22, 2023

REQUEST: Promotion to Reviewer for @agilgur5 argoproj/argoproj#242

Closed

8 tasks

toyamagu-2021 mentioned this pull request Jan 4, 2024

fix: ensure workflow wait for onExit hook for DAG template. Fixes #11880 #12436

Merged

agilgur5 mentioned this pull request May 12, 2024

fix(test): add a sleep to TestLoadToStream #13037

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: reduce test & CI flakes with retries + longer waits & timeouts #11753

ci: reduce test & CI flakes with retries + longer waits & timeouts #11753

agilgur5 commented Sep 6, 2023 •

edited

Loading

agilgur5 commented Sep 6, 2023 •

edited

Loading

agilgur5 commented Sep 6, 2023 •

edited

Loading

agilgur5 commented Sep 6, 2023

agilgur5 commented Sep 6, 2023 •

edited

Loading

agilgur5 commented Sep 6, 2023

agilgur5 commented Sep 7, 2023 •

edited

Loading

ci: reduce test & CI flakes with retries + longer waits & timeouts #11753

ci: reduce test & CI flakes with retries + longer waits & timeouts #11753

Conversation

agilgur5 commented Sep 6, 2023 • edited Loading

Motivation

Modifications

Verification

Future Work

agilgur5 commented Sep 6, 2023 • edited Loading

agilgur5 commented Sep 6, 2023 • edited Loading

agilgur5 commented Sep 6, 2023

agilgur5 commented Sep 6, 2023 • edited Loading

agilgur5 commented Sep 6, 2023

agilgur5 commented Sep 7, 2023 • edited Loading

agilgur5 commented Sep 6, 2023 •

edited

Loading

agilgur5 commented Sep 6, 2023 •

edited

Loading

agilgur5 commented Sep 6, 2023 •

edited

Loading

agilgur5 commented Sep 6, 2023 •

edited

Loading

agilgur5 commented Sep 7, 2023 •

edited

Loading