Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connectivity test: check for deleted cilium agent pod in health probe #2146

Merged

Conversation

mhofstetter
Copy link
Member

@mhofstetter mhofstetter commented Dec 4, 2023

Currently, the health probe connectivity test fails forever if an Cilium Agent Pod that existed when starting the connectivity tests no longer exists.

The reason is that the health probes uses the list of Cilium Agent pods that got fetched at the beginning of the connectivity test run.

Therefore, this commit adds a check whether the Pod still exists. If not, the health probe check fails.

The underlying reason is often that a underlying K8s node has been deleted in the meantime (since starting the tests).

Example of an infinite health probe attempt (until GitHub action timeout) in cilium/cilium due to Cilium Agent Pod deletion (Node has been deleted): https://github.com/cilium/cilium/actions/runs/7060997016/job/19221753163

Related PR (connectivity test timeout): #2145

Alternative

An alternative would be to not rely on the Cilium Agent Pods that have been gathered at the beginning of the test run - and instead re-fetch them at the beginning of the Health Probe test scenario.

This might be something for a follow up PR. In this case i think it would be worth to keep the part about informing the user that a Cilium Agent Pod has been deleted. Maybe in a separate test that only checks for this.

Kind of boils down to the question what the health probe test should actually check for.

Currently, the health probe connectivity test fails forever if an
Cilium Agent Pod that existed when starting the connectivity tests no
longer exists.

The reason is that the health probes uses the list of Cilium Agent pods
that got fetched at the beginning of the connectivity test run.

Therefore, this commit adds a check whether the Pod still exists. If not,
the health probe check fails.

The underlying reason is often that a underlying K8s node has been deleted
in the meantime (since starting the tests).

Signed-off-by: Marco Hofstetter <[email protected]>
@mhofstetter mhofstetter marked this pull request as ready for review December 4, 2023 11:23
@mhofstetter mhofstetter requested a review from a team as a code owner December 4, 2023 11:23
Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 5, 2023
@tklauser tklauser merged commit 04f8372 into cilium:main Dec 5, 2023
20 checks passed
@mhofstetter mhofstetter deleted the pr/mhofstetter/ct-healthprobe-check-pod branch December 5, 2023 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants