Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: reduce test & CI flakes with retries + longer waits & timeouts #11753

Merged
merged 5 commits into from
Sep 7, 2023

Conversation

agilgur5
Copy link
Contributor

@agilgur5 agilgur5 commented Sep 6, 2023

Partial fix for #10807 / #9027

Motivation

  • signal_test.go was flaking quite a bit in CI

    • examples (all same test, same error): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
      • the most frequent test flake I've seen by a good margin (~20x (?) more than any other one, anecdotally)
  • the rest are incidental, but related, fixes for flakes that occurred in CI runs for this PR itself (and that have been seen in CI runs elsewhere too)

signals_test failure logs for posterity
  stop-terminate-4lf2q : Failed Stopped with strategy 'Stop'
    signals_test.go:44: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:44
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:36
        	Error:      	Not equal: 
        	            	expected: "Succeeded"
        	            	actual  : "Running"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(v1alpha1.NodePhase) (len=9) "Succeeded"
        	            	+(v1alpha1.NodePhase) (len=7) "Running"
        	            	 
        	Test:       	TestSignalsSuite/TestStopBehavior
=== FAIL: SignalsSuite/TestStopBehavior

Modifications

  • TestStopBehavior in particular was flaking in checking onExit (all examples above are on the exact same line), so bumped that specifically

  • TestStatusCondition flaked on this PR (see below comment), so bumped its wait as well

  • make codegen flaked on this PR (see below comment), so fixed that as well

  • and bumped a timeout for another E2E flake on make wait

    • it's timed out a few times, so bump it up one minute to 5m
      • examples: 1, 2

Verification

This is only reproducible on CI unfortunately

Future Work

As mentioned above:

  1. More granular killDuration / wait per test as they increase total test time
  2. Different wait times on CI vs local

Also:

  1. Bump k8s swagger as mentioned below
  2. Authenticate some of the GH requests to get rate limited less? It would be great if GH recognized it was coming from itself and increased limits without needing to re-auth

@agilgur5
Copy link
Contributor Author

agilgur5 commented Sep 6, 2023

Ya gotta be kidding me, a different test flaked 😭

also looks to be a timeout issue though, but in a completely different E2E test suite

TestStatusCondition failure log for posterity
http-template-condition-68dv5 :  
    agent_test.go:146: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:146
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:145
        	Error:      	Not equal: 
        	            	expected: "Failed"
        	            	actual  : ""
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(v1alpha1.WorkflowPhase) (len=6) "Failed"
        	            	+(v1alpha1.WorkflowPhase) ""
        	            	 
        	Test:       	TestAgentSuite/TestStatusCondition
    agent_test.go:149: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:149
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:145
        	Error:      	Expected value not to be nil.
        	Test:       	TestAgentSuite/TestStatusCondition
    agent_test.go:1[53](https://github.com/argoproj/argo-workflows/actions/runs/6081856023/job/16498450728#step:16:54): 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:153
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:145
        	Error:      	Expected value not to be nil.
        	Test:       	TestAgentSuite/TestStatusCondition
    agent_test.go:1[57](https://github.com/argoproj/argo-workflows/actions/runs/6081856023/job/16498450728#step:16:58): 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:157
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:145
        	Error:      	Expected value not to be nil.
        	Test:       	TestAgentSuite/TestStatusCondition
    agent_test.go:1[61](https://github.com/argoproj/argo-workflows/actions/runs/6081856023/job/16498450728#step:16:62): 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:161
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:[68](https://github.com/argoproj/argo-workflows/actions/runs/6081856023/job/16498450728#step:16:69)
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/agent_test.go:145
        	Error:      	Expected value not to be nil.
        	Test:       	TestAgentSuite/TestStatusCondition
=== FAIL: AgentSuite/TestStatusCondition

@agilgur5 agilgur5 changed the title fix(test): bump signals StopBehavior timeout due to flakes fix(test): bump signals StopBehavior wait due to flakes Sep 6, 2023
@agilgur5
Copy link
Contributor Author

agilgur5 commented Sep 6, 2023

Annnd then I checked ~50+ recent E2E test failures and none of them match the new TestStatusCondition flake (see above comment). So, at least by my recent count, this is the first time this specific test has flaked 😕 😐 😭

EDIT: found some more examples of TestStatusCondition failing after this was merged: 1, 2, 3

@agilgur5
Copy link
Contributor Author

agilgur5 commented Sep 6, 2023

adjfklas;dfk, StopBehavior failed again despite the bump... 😐

Gonna bump it once more...

@agilgur5 agilgur5 added the area/build Build or GithubAction/CI issues label Sep 6, 2023
@agilgur5
Copy link
Contributor Author

agilgur5 commented Sep 6, 2023

so all the tests passed.... and then Codegen failed......... the whack-a-mole is real.... 😵‍💫

this CI error I have seen at least once before though, so at least it's not unique! 😅

Examples: 1, 2

Test logs for posterity
go run ./hack/swagger kubeifyswagger dist/swaggifed.swagger.json dist/kubeified.swagger.json
panic: invalid character '<' looking for beginning of value

goroutine 1 [running]:
main.getKubernetesSwagger()
	/home/runner/work/argo-workflows/argo-workflows/hack/swagger/kubeifyswagger.go:90 +0xe5
main.kubeifySwagger({0x7ffc2e045c59?, 0x1704f78?}, {0x7ffc2e045c75, 0x1b})
	/home/runner/work/argo-workflows/argo-workflows/hack/swagger/kubeifyswagger.go:28 +0x3ca
main.main()
	/home/runner/work/argo-workflows/argo-workflows/hack/swagger/main.go:10 +0x85
exit status 2
make: *** [Makefile:651: dist/kubeified.swagger.json] Error 1

From the test logs, it is failing on make dist/kubeified.swagger.json with this line specifically erroring on decoding k8s Swagger JSON.
It's erroring on a < -- i.e. the curl for this got back HTML instead of JSON. The full HTML is not output but it's likely from a rate limit error or something from GitHub. Some 400+ HTTP status code probably (429 is likely).

The recurl.sh script does not handle curl getting a bad status code, so will need to modify that slightly to account not just for curl failing but for the response being a bad status code as well.

Side note: k8s 1.23.3 swagger is outdated now, should probably be updated

- had some CI issues on `codegen` when trying to retrieve k8s swagger
  - it tried to decode JSON but instead got some HTML, i.e. the `curl` got a 429 rate limit or something and got an HTML page instead
    - `curl --fail` has `curl` exit on bad status codes, so this should now retry on a bad status code
  - examples: https://github.com/argoproj/argo-workflows/actions/runs/6094143833/job/16535230915?pr=11753, https://github.com/argoproj/argo-workflows/actions/runs/6084230658/job/16505949481

Signed-off-by: Anton Gilgur <[email protected]>
- it's timed out a few times, so bump it up one minute to 5m
  - examples: https://github.com/argoproj/argo-workflows/actions/runs/6101699486/job/16558698083?pr=11766
    - can't seem to find another example but I've definitely seen it more than once before

Signed-off-by: Anton Gilgur <[email protected]>
@agilgur5 agilgur5 changed the title fix(test): bump signals StopBehavior wait due to flakes fix: reduce test & CI flakes with retries + longer waits & timeouts Sep 6, 2023
@agilgur5
Copy link
Contributor Author

agilgur5 commented Sep 6, 2023

All green checks has never looked so good 😭 🎉 🚀

@terrytangyuan terrytangyuan changed the title fix: reduce test & CI flakes with retries + longer waits & timeouts ci: reduce test & CI flakes with retries + longer waits & timeouts Sep 7, 2023
@terrytangyuan terrytangyuan enabled auto-merge (squash) September 7, 2023 01:22
@terrytangyuan terrytangyuan merged commit 1f2e822 into argoproj:master Sep 7, 2023
29 checks passed
@agilgur5 agilgur5 deleted the fix-ci-bump-signals-test branch September 7, 2023 01:51
@agilgur5
Copy link
Contributor Author

agilgur5 commented Sep 7, 2023

For reference, my hypothesis is that we're getting some of these test flakes in CI specifically perhaps because the CI runner is bottlenecking on CPU or memory. The standard runner is 2 core CPU + 7 GB RAM.
Given that CI is running DinD with a k3d cluster and parallelized tests, I would be surprised if CPU isn't maxed out and memory might be close to max. I typically give my devcontainer up to 6 CPU cores and 8GB RAM, as an anecdotal comparison.

With max CPU, everything takes longer and I/O is waiting on CPU. There would also be more CPU pre-emption and switching (although I don't quite know how the goroutine implementation handles this, as they are user-land "threads").

With both of those scenarios, it would roughly make sense why CI needs longer waits etc, as the CPU just can't get to everything timely.

Just a hypothesis. Either way, we can't increase the runner size in default OSS. Though I wonder if GitHub might grant an exception to a large, graduated CNCF project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/build Build or GithubAction/CI issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants