Flakey tests #10807

terrytangyuan · 2023-04-03T14:04:07Z

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Unit tests failed: https://github.com/argoproj/argo-workflows/actions/runs/4592799925/jobs/8110154390

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

See recent CI builds

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

The text was updated successfully, but these errors were encountered:

tico24 · 2023-04-04T06:43:44Z

This is caused by #10768 - Until kit is fixed or reverted out, this will continue to happen.

From your unit test logs:

curl -q https://raw.githubusercontent.com/kitproj/kit/main/install.sh | sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   510  100   510    0     0    990      0 --:--:-- --:--:-- --:--:--   992
+ curl --retry 99 -vfsL https://api.github.com/repos/kitproj/kit/releases/latest -o /tmp/latest
*   Trying 192.30.255.116:443...
* Connected to api.github.com (192.30.255.116) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
...
< date: Mon, 03 Apr 2023 04:15:11 GMT
< server: Varnish
< strict-transport-security: max-age=31536000; includeSubdomains; preload
< x-content-type-options: nosniff
< x-frame-options: deny
< x-xss-protection: 1; mode=block
< content-security-policy: default-src 'none'; style-src 'unsafe-inline'
< access-control-allow-origin: *
< access-control-expose-headers: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used, X-RateLimit-Resource, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset
< content-type: application/json; charset=utf-8
< referrer-policy: origin-when-cross-origin, strict-origin-when-cross-origin
< x-github-media-type: github.v3; format=json
< x-ratelimit-limit: 60
< x-ratelimit-remaining: 0
< x-ratelimit-reset: 1680495768
< x-ratelimit-resource: core
< x-ratelimit-used: 60
< content-length: 278
< x-github-request-id: F380:89D7:380A122:3A25002:642A52CF
* The requested URL returned error: 403
* stopped the pause stream!
* Connection #0 to host api.github.com left intact
make: *** [Makefile:464: kit] Error 22
Error: Process completed with exit code 2.

terrytangyuan · 2023-04-04T12:53:21Z

Here's the log: https://pipelines.actions.githubusercontent.com/serviceHosts/49efa180-38ba-4f73-8389-f407aa841894/_apis/pipelines/1/runs/34951/signedlogcontent/2?urlExpires=2023-04-04T12%3A52%3A27.4837908Z&urlSigningMethod=HMACV1&urlSignature=7YLqPxOPpolP4cnkttw6DW97ZgvYY6ui%2Fs438NRwSMY%3D

https://github.com/argoproj/argo-workflows/actions/runs/4592799925/jobs/8110154390

It's not related to kit. Are you looking at somewhere else?

tico24 · 2023-04-04T12:54:21Z

I was looking at the failed e2e test, apologies.

terrytangyuan · 2023-04-04T13:28:06Z

chore: Bump killDuration for signals_test.go to avoid flaky test result #11064

Also flaky e2e test test-executor https://github.com/argoproj/argo-workflows/actions/runs/4608012677/jobs/8143253843:

 Condition "to have running pod" met after 6s
Waiting 1m40s for workflow metadata.name=stop-terminate-dktw8
 ● stop-terminate-dktw8   Workflow 0s      
 └ ● stop-terminate-dktw8 DAG      0s      
 └ ● A                    Pod      0s      

    signals_test.go:78: timeout after 1m40s waiting for condition
Checking expectation stop-terminate-dktw8
stop-terminate-dktw8 : Running 
    signals_test.go:81: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:81
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:80
        	Error:      	[]v1alpha1.WorkflowPhase{"Failed", "Error"} does not contain "Running"
        	Test:       	TestSignalsSuite/TestTerminateBehavior
    signals_test.go:84: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:84
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:80
        	Error:      	[]v1alpha1.NodePhase{"Failed", "Error"} does not contain "Running"
        	Test:       	TestSignalsSuite/TestTerminateBehavior
=== FAIL: SignalsSuite/TestTerminateBehavior
FAIL	github.com/argoproj/argo-workflows/v3/test/e2e	535.374s

terrytangyuan · 2023-04-07T16:57:17Z

TODO

Flaky test-cli https://github.com/argoproj/argo-workflows/actions/runs/4639608340/jobs/8210973276

../../dist/argo -n argo resume @latest --node-field-selector inputs.parameters.tag.value=suspend1-tag1
exit status 12023/04/07 16:47:26 Failed to resume @latest: rpc error: code = Internal desc = currently, set only targets suspend nodes: no suspend nodes matching nodeFieldSelector: inputs.parameters.tag.value=suspend1-tag1

    cli_test.go:531: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:531
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/when.go:450
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/when.go:459
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:530
        	Error:      	Received unexpected error:
        	            	exit status 1
        	Test:       	TestCLISuite/TestNodeSuspendResume
Waiting 1m0s for workflow metadata.name=node-suspend-q4m95
 ● node-suspend-q4m95   Workflow  0s      
 └ ● node-suspend-q4m95 Steps     0s      
 └ ✔ step1              Pod       14s     
 └ ✔ [0]                StepGroup 18s     
 └ ● [1]                StepGroup 0s      
 └ ● suspend1           Suspend   0s      

Condition "suspended node" met after 0s
../../dist/argo -n argo stop @latest --node-field-selector inputs.parameters.tag.value=suspend2-tag1 --message because
exit status 1time="2023-04-07T16:47:26.732Z" level=fatal msg="rpc error: code = Internal desc = currently, set only targets suspend nodes: no suspend nodes matching nodeFieldSelector: inputs.parameters.tag.value=suspend2-tag1"

    cli_test.go:539: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:539
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/when.go:450
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/when.go:459
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:538
        	Error:      	Received unexpected error:
        	            	exit status 1
        	Test:       	TestCLISuite/TestNodeSuspendResume
Waiting 1m0s for workflow metadata.name=node-suspend-q4m95
 ● node-suspend-q4m95   Workflow  0s      
 └ ✔ step1              Pod       14s     
 └ ✔ [0]                StepGroup 18s     
 └ ● node-suspend-q4m95 Steps     0s      
 └ ● suspend1           Suspend   0s      
 └ ● [1]                StepGroup 0s      

    cli_test.go:543: timeout after 1m0s waiting for condition
Checking expectation node-suspend-q4m95
node-suspend-q4m95 : Running 
    cli_test.go:546: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:546
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:545
        	Error:      	Expect "" to match "child 'node-suspend-.*' failed"
        	Test:       	TestCLISuite/TestNodeSuspendResume

terrytangyuan · 2023-04-11T15:14:38Z

TODO

test-cli: https://github.com/argoproj/argo-workflows/actions/runs/4663229967/jobs/8254446014

../../dist/argo -n argo retry retry-with-recreated-pvc
exit status 1time="2023-04-11T02:56:07.436Z" level=fatal msg="rpc error: code = InvalidArgument desc = workflow must be Failed/Error to retry"

    cli_test.go:901: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:901
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:265
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:900
        	Error:      	Received unexpected error:
        	            	exit status 1
        	Test:       	TestCLISuite/TestWorkflowRetryWithRecreatedPVC
        	Messages:   	time="2023-04-11T02:56:07.436Z" level=fatal msg="rpc error: code = InvalidArgument desc = workflow must be Failed/Error to retry"
Waiting 1m0s for workflow metadata.name=retry-with-recreated-pvc
 ● retry-with-recreated-pvc   Workflow  0s      
 └ ● retry-with-recreated-pvc Steps     0s      
 └ ● [0]                      StepGroup 0s      
 └ ◷ generate                 Pod       0s      

    cli_test.go:907: timeout after 1m0s waiting for condition
Checking expectation retry-with-recreated-pvc
retry-with-recreated-pvc : Running 
    suite.go:87: test panicked: runtime error: invalid memory address or nil pointer dereference
        goroutine 2333 [running]:
        runtime/debug.Stack()
        	/opt/hostedtoolcache/go/1.19.7/x64/src/runtime/debug/stack.go:24 +0x65
        github.com/stretchr/testify/suite.failOnPanic(0xc000901520, {0x19d8620, 0x2cc8ab0})
        	/home/runner/go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:87 +0x3b
        github.com/stretchr/testify/suite.Run.func1.1()
        	/home/runner/go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:183 +0x252
        panic({0x19d8620, 0x2cc8ab0})
        	/opt/hostedtoolcache/go/1.19.7/x64/src/runtime/panic.go:884 +0x212
        github.com/argoproj/argo-workflows/v3/test/e2e.(*CLISuite).TestWorkflowRetryWithRecreatedPVC.func2(0x1ef6880?, 0xc00011a008?, 0xc0000ce328)
        	/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:910 +0x3a
        github.com/argoproj/argo-workflows/v3/test/e2e/fixtures.(*Then).expectWorkflow(0xc00089f448, {0xc00080e3a8, 0x18}, 0x1d4d648)
        	/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68 +0x31f
        github.com/argoproj/argo-workflows/v3/test/e2e/fixtures.(*Then).ExpectWorkflow(0xc00089f448, 0xc00089f418?)
        	/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43 +0x4f
        github.com/argoproj/argo-workflows/v3/test/e2e.(*CLISuite).TestWorkflowRetryWithRecreatedPVC(0x0?)
        	/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:909 +0x508
        reflect.Value.call({0xc0003dec60?, 0xc00033b458?, 0xc0002b9c00?}, {0x1c3ed66, 0x4}, {0xc00089fe70, 0x1, 0x17cde05?})
        	/opt/hostedtoolcache/go/1.19.7/x64/src/reflect/value.go:584 +0x8c5
        reflect.Value.Call({0xc0003dec60?, 0xc00033b458?, 0xc00021af00?}, {0xc00089fe70?, 0x7fa4e4912258?, 0xd0?})
        	/opt/hostedtoolcache/go/1.19.7/x64/src/reflect/value.go:368 +0xbc
        github.com/stretchr/testify/suite.Run.func1(0xc000901520)
        	/home/runner/go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:197 +0x4b6
        testing.tRunner(0xc000901520, 0xc0004bce10)
        	/opt/hostedtoolcache/go/1.19.7/x64/src/testing/testing.go:1446 +0x10b
        created by testing.(*T).Run
        	/opt/hostedtoolcache/go/1.19.7/x64/src/testing/testing.go:1493 +0x35f

terrytangyuan · 2023-04-11T17:22:57Z

Hooks tests are failing: https://github.com/argoproj/argo-workflows/actions/runs/4670044854/jobs/8269390253?pr=10879

=== PASS: HooksSuite/TestTemplateLevelHooksStepSuccessVersion
    suite.go:87: test panicked: runtime error: invalid memory address or nil pointer dereference
        goroutine 627 [running]:
        runtime/debug.Stack()
        	/opt/hostedtoolcache/go/1.19.8/x64/src/runtime/debug/stack.go:24 +0x65
        github.com/stretchr/testify/suite.failOnPanic(0xc000701040, {0x1a5f9a0, 0x2dfeab0})
        	/home/runner/go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:87 +0x3b
        github.com/stretchr/testify/suite.Run.func1.1()
        	/home/runner/go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:183 +0x252
        panic({0x1a5f9a0, 0x2dfeab0})
        	/opt/hostedtoolcache/go/1.19.8/x64/src/runtime/panic.go:884 +0x212
        github.com/argoproj/argo-workflows/v3/test/e2e.(*HooksSuite).TestTemplateLevelHooksStepSuccessVersion.func9(0x1f8f620?, 0xc00011a008?, 0xc0004b4f40?)
        	/home/runner/work/argo-workflows/argo-workflows/test/e2e/hooks_test.go:168 +0x19
        github.com/argoproj/argo-workflows/v3/test/e2e/fixtures.(*Then).ExpectWorkflowNode.func1(0x1f8f620?, 0xc0007a8da0, 0xc0004b54b0?)
        	/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:110 +0x4c9
        github.com/argoproj/argo-workflows/v3/test/e2e/fixtures.(*Then).expectWorkflow(0xc0004b5580, {0xc0007fe3e0, 0x1f}, 0xc0004b5520)
        	/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68 +0x31f
        github.com/argoproj/argo-workflows/v3/test/e2e/fixtures.(*Then).ExpectWorkflowNode(0xc000347580?, 0xc000347570?, 0x1?)
        	/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:85 +0x51
        github.com/argoproj/argo-workflows/v3/test/e2e.(*HooksSuite).TestTemplateLevelHooksStepSuccessVersion(0x0?)
        	/home/runner/work/argo-workflows/argo-workflows/test/e2e/hooks_test.go:165 +0x325
        reflect.Value.call({0xc0000a6fc0?, 0xc00011beb0?, 0x46bdf9?}, {0x1cd5aca, 0x4}, {0xc000347e70, 0x1, 0xc000666580?})
        	/opt/hostedtoolcache/go/1.19.8/x64/src/reflect/value.go:584 +0x8c5
        reflect.Value.Call({0xc0000a6fc0?, 0xc00011beb0?, 0xc0004f6d00?}, {0xc000347e70?, 0x7f1cd42bdd00?, 0xd0?})
        	/opt/hostedtoolcache/go/1.19.8/x64/src/reflect/value.go:368 +0xbc
        github.com/stretchr/testify/suite.Run.func1(0xc000701040)
        	/home/runner/go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:197 +0x4b6
        testing.tRunner(0xc000701040, 0xc000778240)
        	/opt/hostedtoolcache/go/1.19.8/x64/src/testing/testing.go:1446 +0x10b
        created by testing.(*T).Run
        	/opt/hostedtoolcache/go/1.19.8/x64/src/testing/testing.go:1493 +0x35f
=== SLOW TEST:  HooksSuite/TestTemplateLevelHooksDagFailVersion took 16s
=== SLOW TEST:  HooksSuite/TestTemplateLevelHooksDagSuccessVersion took 30s
=== SLOW TEST:  HooksSuite/TestTemplateLevelHooksStepFailVersion took 18s
=== SLOW TEST:  HooksSuite/TestTemplateLevelHooksStepSuccessVersion took 41s
=== CONT  TestHooksSuite
    e2e_suite.go:86: to learn how to diagnose failed tests: https://argoproj.github.io/argo-workflows/running-locally/#running-e2e-tests-locally
--- FAIL: TestHooksSuite (107.05s)
    --- PASS: TestHooksSuite/TestTemplateLevelHooksDagFailVersion (16.67s)
    --- PASS: TestHooksSuite/TestTemplateLevelHooksDagSuccessVersion (30.26s)
    --- PASS: TestHooksSuite/TestTemplateLevelHooksStepFailVersion (18.62s)
    --- FAIL: TestHooksSuite/TestTemplateLevelHooksStepSuccessVersion (41.50s)
FAIL
FAIL	github.com/argoproj/argo-workflows/v3/test/e2e	235.744s
FAIL

@GeunSam2 Would you like to take a look at this one? It seems pretty consistent. Another example https://github.com/argoproj/argo-workflows/actions/runs/4670440612/jobs/8270253757

GeunSam2 · 2023-04-12T03:21:48Z

Okay I'll check the reason of hooks test failing.

terrytangyuan · 2023-04-17T00:07:45Z

TODO

E2E Tests (test-cli, mysql)

../../dist/argo -n argo stop @latest --node-field-selector inputs.parameters.tag.value=suspend2-tag1 --message because
exit status 1time="2023-04-15T12:49:21.333Z" level=fatal msg="rpc error: code = Internal desc = currently, set only targets suspend nodes: no suspend nodes matching nodeFieldSelector: inputs.parameters.tag.value=suspend2-tag1"

    cli_test.go:539: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:539
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/when.go:450
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/when.go:459
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:538
        	Error:      	Received unexpected error:
        	            	exit status 1
        	Test:       	TestCLISuite/TestNodeSuspendResume
Waiting 1m0s for workflow metadata.name=node-suspend-l88wm
 ● node-suspend-l88wm   Workflow  0s      
 └ ● node-suspend-l88wm Steps     0s      
 └ ✔ step1              Pod       6s      
 └ ✔ [0]                StepGroup 13s     
 └ ● suspend1           Suspend   0s      
 └ ● [1]                StepGroup 0s      

    cli_test.go:543: timeout after 1m0s waiting for condition
Checking expectation node-suspend-l88wm
node-suspend-l88wm : Running 
    cli_test.go:546: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:546
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:545
        	Error:      	Expect "" to match "child 'node-suspend-.*' failed"
        	Test:       	TestCLISuite/TestNodeSuspendResume

terrytangyuan · 2023-05-08T20:06:32Z

Fixed in chore: Fix flakey test in workflowpod_test.go #11056

--- FAIL: Test_createSecretVolumesFromArtifactLocations_SSECUsed (0.01s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x118 pc=0x1c983f3]

goroutine 22671 [running]:
testing.tRunner.func1.2({0x1ee10a0, 0x356a460})
	/opt/hostedtoolcache/go/1.20.3/x64/src/testing/testing.go:1526 +0x24e
testing.tRunner.func1()
	/opt/hostedtoolcache/go/1.20.3/x64/src/testing/testing.go:1529 +0x39f
panic({0x1ee10a0, 0x356a460})
	/opt/hostedtoolcache/go/1.20.3/x64/src/runtime/panic.go:884 +0x213
github.com/argoproj/argo-workflows/v3/workflow/controller.Test_createSecretVolumesFromArtifactLocations_SSECUsed(0xc002a8c640?)
	/home/runner/work/argo-workflows/argo-workflows/workflow/controller/workflowpod_test.go:1250 +0x773
testing.tRunner(0xc001d3d860, 0x231d1f0)
	/opt/hostedtoolcache/go/1.20.3/x64/src/testing/testing.go:1576 +0x10b
created by testing.(*T).Run
	/opt/hostedtoolcache/go/1.20.3/x64/src/testing/testing.go:1629 +0x3ea
FAIL	github.com/argoproj/argo-workflows/v3/workflow/controller	30.340s

terrytangyuan · 2023-05-09T20:21:15Z

chore: Fix flaky test in functional_test.go #11060

     functional_test.go:736: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/functional_test.go:736
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/functional_test.go:735
        	Error:      	Not equal: 
        	            	expected: "Failed"
        	            	actual  : "Running"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(v1alpha1.WorkflowPhase) (len=6) "Failed"
        	            	+(v1alpha1.WorkflowPhase) (len=7) "Running"
        	            	 
        	Test:       	TestFunctionalSuite/TestParametrizableAds

terrytangyuan · 2023-07-11T23:06:52Z

test: Fix flaky test TestWorkflowLevelHooksWaitForTriggeredHook #11346

hooks_test.go:419: 
[1000](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1001)
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/hooks_test.go:419
[1001](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1002)
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
[1002](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1003)
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
[1003](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1004)
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/hooks_test.go:417
[1004](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1005)
        	Error:      	Not equal: 
[1005](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1006)
        	            	expected: "1/1"
[1006](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1007)
        	            	actual  : "2/2"
[1007](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1008)
        	            	
[1008](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1009)
        	            	Diff:
[1009](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1010)
        	            	--- Expected
[1010](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1011)
        	            	+++ Actual
[1011](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1012)
        	            	@@ -1,2 +1,2 @@
[1012](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1013)
        	            	-(v1alpha1.Progress) (len=3) "1/1"
[1013](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1014)
        	            	+(v1alpha1.Progress) (len=3) "2/2"
[1014](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1015)
        	            	 
[1015](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1016)
        	Test:       	TestHooksSuite/TestTemplateLevelHooksWaitForTriggeredHook

terrytangyuan · 2023-07-15T04:37:49Z

test: Fix flakey TestLogProblems #11378

    cli_test.go:360: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:360
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:265
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:357
        	Error:      	"[log-problems-n4r8n-report-4285119875: one log-problems-n4r8n-report-3816278739: three log-problems-n4r8n-report-4205137589: four log-problems-n4r8n-report-803032099: five]" should have 5 item(s), but has 4
        	Test:       	TestCLISuite/TestLogProblems

terrytangyuan · 2023-07-18T16:52:02Z

Another one: #11384

terrytangyuan · 2023-07-20T15:27:35Z

Hooks tests are very flaky. Disabled them for now. Need to investigate potential bugs:

test: Comment out a flaky assertion in TestTemplateLevelHooksWaitForTriggeredHook #11406
test: Comment out a flaky test in TestTemplateLevelHooksStepSuccessVersion #11384

cc @toyamagu-2021 Would you like to help us debug these since you added these tests? (after you wrap up with the UI issues)

agilgur5 · 2023-09-21T02:46:22Z

Adding TestTemplateLevelHooksDagSuccessVersion from #10307 (comment) to the list. It was mentioned above too, but only 1/2 failing tests in that comment were fixed. That one may actually be a bug (as it's getting a nil pointer, not just a failed test), not just a flake, not sure.

Also this is technically a duplicate issue of #9027. They've got different flakes listed in each, but could consolidate into one issue

isubasinghe · 2024-05-02T06:27:47Z

I think we should really prevent these flaky tests being merged in the first place. @terrytangyuan and @agilgur5 what are your opinions on running the test suite in parallel 10 (or so) times and only allowing merging when for all runs the tests passed? If we can launch the jobs in parallel, we shouldn't suffer any wait time increases.

We probably would need to pay for the extra compute, but I suspect it'd be cheaper than the person hours that go into dealing with flakey tests.

agilgur5 · 2024-05-02T16:24:06Z

If we can launch the jobs in parallel, we shouldn't suffer any wait time increases.

Ostensibly yes, but the average wait time would increase since some jobs queue longer than others and some wait on network longer etc. This would probably put us over the limit of parallel jobs more frequently, causing more queueing as well

@terrytangyuan and @agilgur5 what are your opinions on running the test suite in parallel 10 (or so) times and only allowing merging when for all runs the tests passed?

I don't think this would actually help solve the problem. We're taking somewhat inaccurate flakey tests usually caused by race conditions and taking an even more inaccurate approach to it of "run all tests more times".

Most PRs don't even change the tests much, if at all, but they will fail more often with a change like this.

but I suspect it'd be cheaper than the person hours that go into dealing with flakey tests.

which would cause a lot of hours of investigation or confusion due to the existing flakes that were not caused by new code. that's the current biggest issue, and this would increase that.

I think we should be more precise in our approach.

So if we wanted to take an approach like this I would recommend one of:

only running the new tests from a PR several times -- EDIT: this is not quite correct, see below
run flake detection on a schedule, e.g. nightly or weekly
- we'd also want to run with go test -race

isubasinghe · 2024-05-02T23:15:00Z

Ostensibly yes, but the average wait time would increase since some jobs queue longer than others and some wait on network longer etc. This would probably put us over the limit of parallel jobs more frequently, causing more queueing as well

I presume we are paying more capacity here, but I can't see the time increasing by that much, sure some pipelines will take a bit longer but thats fine as long as the wait times generally are similar to what we have now.

I don't think this would actually help solve the problem. We're taking somewhat inaccurate flakey tests usually caused by race conditions and taking an even more inaccurate approach to it of "run all tests more times".

I see where you are coming from, I kind of elided the fact that when we implement this, we should have no more flakey tests.

Most PRs don't even change the tests much, if at all, but they will fail more often with a change like this.

only running the new tests from a PR several times

This effectively what I am saying I suppose, but to be more precise my suggestion is
a) fix all current flakey tests and do not accept anymore PRs that introduce tests (delay feature PRs as well)
b) then run tests from each new PRs several times, perhaps it is enough to only do this when new tests are introduced?

Some kind of flake test detection would be nice to have as well.

agilgur5 · 2024-05-04T16:29:13Z

I presume we are paying more capacity here

as long as the wait times generally are similar to what we have now.

As far as I know, we're not currently paying anything and are on the free plan. There are concurrency limits that apply per plan (and I believe they apply to the entire GH org, not per repo). If we run 140+ (10 * (13 E2Es + 1 unit tests)) more jobs per run, we will almost certainly hit that limit, which will cause queueing, i.e. some parallel jobs will end up running sequentially, which will definitely increase wait times. It also may increase wait times across the argoproj GH org.

I kind of elided the fact that when we implement this, we should have no more flakey tests.

Tall ask -- will this ever be true? 😅

perhaps it is enough to only do this when new tests are introduced?

I wrote to only run the new tests themselves multiple times. But actually, rethinking this, neither of these would be correct; a source code change can cause a test flake in an existing test. E.g. a new unhandled race was introduced. That exact scenario has happened multiple times already

I'm still thinking a nightly or weekly job would make more sense than on each PR.

…er. Fixes argoproj#10807 While investigating the flaky `MetricsSuite/TestMetricsEndpoint` test that's been failing periodically for awhile now, I noticed this in the controller logs ([example](https://github.com/argoproj/argo-workflows/actions/runs/11221357877/job/31191811077)): ``` controller: time="2024-10-07T18:22:14.793Z" level=info msg="Starting dummy metrics server at localhost:9090/metrics" server: time="2024-10-07T18:22:14.793Z" level=info msg="Creating event controller" asyncDispatch=false operationQueueSize=16 workerCount=4 server: time="2024-10-07T18:22:14.800Z" level=info msg="GRPC Server Max Message Size, MaxGRPCMessageSize, is set" GRPC_MESSAGE_SIZE=104857600 server: time="2024-10-07T18:22:14.800Z" level=info msg="Argo Server started successfully on http://localhost:2746" url="http://localhost:2746" controller: I1007 18:22:14.800947 25045 leaderelection.go:260] successfully acquired lease argo/workflow-controller controller: time="2024-10-07T18:22:14.801Z" level=info msg="new leader" leader=local controller: time="2024-10-07T18:22:14.801Z" level=info msg="Generating Self Signed TLS Certificates for Telemetry Servers" controller: time="2024-10-07T18:22:14.802Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics" controller: panic: listen tcp :9090: bind: address already in use controller: controller: goroutine 37 [running]: controller: github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer.func2() controller: /home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:94 +0x16a controller: created by github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer in goroutine 36 controller: /home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:91 +0x53c 2024/10/07 18:22:14 controller: process exited 25045: exit status 2 controller: exit status 2 2024/10/07 18:22:14 controller: backing off 4s ``` I believe this is a race condition introduced in argoproj#11295. Here's the sequence of events that trigger this: 1. Controller starts 2. Dummy metrics server started on port 9090 3. Leader election takes place and controller starts leading 4. Context for dummy metrics server cancelled 5. Metrics server shuts down 6. Prometheus metrics server started on 9090 The problems is steps 5-6 can happen out-of-order, because the shutdown happens after the contxt is cancelled. Per the docs, "a CancelFunc does not wait for the work to stop" (https://pkg.go.dev/context#CancelFunc). The controller needs to explicitly wait for the dummy metrics server to shut down properly before starting the Prometheus metrics server. There's many ways of doing that, and this uses a `WaitGroup`, as that's the simplest approach I could think of. Signed-off-by: Mason Malone <[email protected]>

Partial fix for argoproj#10807. Builds are occasionally failing while pulling images from Docker Hub due to rate limiting, e.g. https://github.com/argoproj/argo-workflows/actions/runs/11564257560/job/32189242898: > Oct 28 23:30:52 fv-az802-461 k3s[2185]: E1028 23:30:52.151698 2185 kuberuntime_image.go:53] "Failed to pull image" err="Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit" image="minio/minio:RELEASE.2022-11-17T23-20-09Z" As explained in the error, logging in increases the limit. We're already doing this in the "Release" workflow, so this copies that over. Signed-off-by: Mason Malone <[email protected]>

MasonM · 2024-11-01T15:13:48Z

I noticed E2E test builds are occasionally failing while pulling images from Docker Hub due to rate limiting. Example:

Oct 28 23:30:52 fv-az802-461 k3s[2185]: E1028 23:30:52.151698 2185 kuberuntime_image.go:53] "Failed to pull image" err="Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit" image="minio/minio:RELEASE.2022-11-17T23-20-09Z"

I entered #13830 to try and fix that by logging in using the same credentials used in the "Release" workflow, but that doesn't work due to security restrictions. It'd be theoretically possible to use docker-cache to cache these images, but that's a lot of complexity.

Using self-hosted runners would definitely solve this, and as discussed in #13767 (comment), we're eligible for the CNCF self-hosted runners. However, it seems the ticket to add us is stalled.

agilgur5 · 2024-11-01T16:52:03Z

It's best to have CI be as generic as possible; if the whole system relies on CNCF self-hosted runners, then forks won't be able to run CI at all. So workarounds may very well be worthwhile, even a generic backoff on rate limits

terrytangyuan added the type/bug label Apr 3, 2023

terrytangyuan changed the title ~~Flaky unit tests~~ Flaky tests Apr 4, 2023

sarabala1979 added the P3 Low priority label Apr 6, 2023

This was referenced Apr 13, 2023

Flakey test about lifecycle hooks (sub issue from #10807) #10897

Closed

fix: Flakey test about lifecycle hooks. Fixes #10897 #10898

Merged

terrytangyuan mentioned this issue May 8, 2023

chore: Fix flakey test in workflowpod_test.go #11056

Merged

This was referenced May 9, 2023

chore: Fix flaky test in functional_test.go #11060

Merged

chore: Bump killDuration for signals_test.go to avoid flaky test result #11064

Merged

terrytangyuan mentioned this issue May 19, 2023

fix: Disable unreliable test #11105

Merged

This comment was marked as resolved.

Sign in to view

stale bot added the problem/stale This has not had a response in some time label Jun 18, 2023

This comment was marked as resolved.

Sign in to view

stale bot removed the problem/stale This has not had a response in some time label Jun 20, 2023

terrytangyuan mentioned this issue Jul 12, 2023

test: Fix flaky test TestWorkflowLevelHooksWaitForTriggeredHook #11346

Merged

terrytangyuan mentioned this issue Jul 17, 2023

test: Fix flakey TestLogProblems #11378

Merged

terrytangyuan reopened this Sep 19, 2023

agilgur5 added the type/tech-debt label Sep 21, 2023

agilgur5 mentioned this issue Sep 21, 2023

Flakey tests preventing merge to master #10529

Closed

This was referenced Sep 24, 2023

ci: temporal workaround for flakiness of TestStopBehavior #11881

Merged

ci: increase sleep time for very flakey hook test #11921

Closed

fix: increase sleep time for very flakey hook test #11922

Closed

toyamagu-2021 mentioned this issue Oct 7, 2023

ci: E2E_ENV_FACTOR to 2 because TestParallel is very flaky #11963

Merged

terrytangyuan closed this as completed in #11963 Oct 10, 2023

agilgur5 reopened this Oct 10, 2023

toyamagu-2021 mentioned this issue Oct 15, 2023

ci: remove flaky assertion from hook e2e test #12007

Merged

agilgur5 mentioned this issue Jan 9, 2024

fix: improve resource duration accuracy. Fixes #12468 #12492

Merged

agilgur5 changed the title ~~Flaky tests~~ Flakey tests Mar 24, 2024

agilgur5 mentioned this issue Mar 30, 2024

CI: Add GH Action for /retest comments to re-run failed jobs #12864

Closed

This was referenced Sep 21, 2024

3.6.0-rc1: Controller panics due to metrics round-tripper data race #13637

Closed

Run Go tests nightly with race detector enabled #13661

Open

MasonM mentioned this issue Oct 9, 2024

fix(controller): handle race when starting metrics server. Fixes #10807 #13731

Merged

agilgur5 closed this as completed in 8100f99 Oct 10, 2024

agilgur5 reopened this Oct 10, 2024

MasonM mentioned this issue Oct 16, 2024

ci: increase timeout for e2e-tests #13769

Merged

MasonM mentioned this issue Oct 29, 2024

ci: login to Docker Hub to avoid rate limiting #13830

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flakey tests #10807

Flakey tests #10807

terrytangyuan commented Apr 3, 2023

tico24 commented Apr 4, 2023 •

edited

Loading

terrytangyuan commented Apr 4, 2023

tico24 commented Apr 4, 2023

terrytangyuan commented Apr 4, 2023 •

edited

Loading

terrytangyuan commented Apr 7, 2023 •

edited

Loading

terrytangyuan commented Apr 11, 2023 •

edited

Loading

terrytangyuan commented Apr 11, 2023 •

edited

Loading

GeunSam2 commented Apr 12, 2023

terrytangyuan commented Apr 17, 2023 •

edited

Loading

terrytangyuan commented May 8, 2023 •

edited

Loading

terrytangyuan commented May 9, 2023 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

terrytangyuan commented Jul 11, 2023 •

edited

Loading

terrytangyuan commented Jul 15, 2023 •

edited

Loading

terrytangyuan commented Jul 18, 2023 •

edited

Loading

terrytangyuan commented Jul 20, 2023

agilgur5 commented Sep 21, 2023 •

edited

Loading

isubasinghe commented May 2, 2024

agilgur5 commented May 2, 2024 •

edited

Loading

isubasinghe commented May 2, 2024

agilgur5 commented May 4, 2024 •

edited

Loading

MasonM commented Nov 1, 2024

agilgur5 commented Nov 1, 2024

Flakey tests #10807

Flakey tests #10807

Comments

terrytangyuan commented Apr 3, 2023

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

tico24 commented Apr 4, 2023 • edited Loading

terrytangyuan commented Apr 4, 2023

tico24 commented Apr 4, 2023

terrytangyuan commented Apr 4, 2023 • edited Loading

terrytangyuan commented Apr 7, 2023 • edited Loading

terrytangyuan commented Apr 11, 2023 • edited Loading

terrytangyuan commented Apr 11, 2023 • edited Loading

GeunSam2 commented Apr 12, 2023

terrytangyuan commented Apr 17, 2023 • edited Loading

terrytangyuan commented May 8, 2023 • edited Loading

terrytangyuan commented May 9, 2023 • edited Loading

This comment was marked as resolved.

This comment was marked as resolved.

terrytangyuan commented Jul 11, 2023 • edited Loading

terrytangyuan commented Jul 15, 2023 • edited Loading

terrytangyuan commented Jul 18, 2023 • edited Loading

terrytangyuan commented Jul 20, 2023

agilgur5 commented Sep 21, 2023 • edited Loading

isubasinghe commented May 2, 2024

agilgur5 commented May 2, 2024 • edited Loading

isubasinghe commented May 2, 2024

agilgur5 commented May 4, 2024 • edited Loading

MasonM commented Nov 1, 2024

agilgur5 commented Nov 1, 2024

tico24 commented Apr 4, 2023 •

edited

Loading

terrytangyuan commented Apr 4, 2023 •

edited

Loading

terrytangyuan commented Apr 7, 2023 •

edited

Loading

terrytangyuan commented Apr 11, 2023 •

edited

Loading

terrytangyuan commented Apr 11, 2023 •

edited

Loading

terrytangyuan commented Apr 17, 2023 •

edited

Loading

terrytangyuan commented May 8, 2023 •

edited

Loading

terrytangyuan commented May 9, 2023 •

edited

Loading

terrytangyuan commented Jul 11, 2023 •

edited

Loading

terrytangyuan commented Jul 15, 2023 •

edited

Loading

terrytangyuan commented Jul 18, 2023 •

edited

Loading

agilgur5 commented Sep 21, 2023 •

edited

Loading

agilgur5 commented May 2, 2024 •

edited

Loading

agilgur5 commented May 4, 2024 •

edited

Loading