connectivity test: check for deleted cilium agent pod in health probe #2146
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, the health probe connectivity test fails forever if an Cilium Agent Pod that existed when starting the connectivity tests no longer exists.
The reason is that the health probes uses the list of Cilium Agent pods that got fetched at the beginning of the connectivity test run.
Therefore, this commit adds a check whether the Pod still exists. If not, the health probe check fails.
The underlying reason is often that a underlying K8s node has been deleted in the meantime (since starting the tests).
Example of an infinite health probe attempt (until GitHub action timeout) in cilium/cilium due to Cilium Agent Pod deletion (Node has been deleted): https://github.com/cilium/cilium/actions/runs/7060997016/job/19221753163
Related PR (connectivity test timeout): #2145
Alternative
An alternative would be to not rely on the Cilium Agent Pods that have been gathered at the beginning of the test run - and instead re-fetch them at the beginning of the Health Probe test scenario.
This might be something for a follow up PR. In this case i think it would be worth to keep the part about informing the user that a Cilium Agent Pod has been deleted. Maybe in a separate test that only checks for this.
Kind of boils down to the question what the health probe test should actually check for.