Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return error for unhealthy backend discovery during subresrource observeBackend calls #298

Merged
merged 7 commits into from
Nov 4, 2024

Conversation

nolancon
Copy link
Collaborator

@nolancon nolancon commented Oct 29, 2024

Description of your changes

When we Observe a subresource client, we skip over any unhealthy backends. We do this so as not to block subsequent Create/Update/Delete calls after observation.

However, during Update we also perform this check/skip as part of lower level observeBackend function calls - this can result in Update silently failing to reconcile a subresource on a backend due to a momentarily unhealthy state.

As such, the check for the backend's healthiness has been split for each subresource client:

  • During the Observe phase, unhealthy backends are simply skipped to allow subsequent operations to continue.
  • During the Update phase (which also calls observeBackend for each subresource client at a lower level), unhealthy backends result in an error and a re-queue of the request - doing this is safe due to the exponential back-off mechanism.

I have:

  • Run make reviewable to ensure this PR is ready for review.
  • Run make ceph-chainsaw to validate these changes against Ceph. This step is not always necessary. However, for changes related to S3 calls it is sensible to validate against an actual Ceph cluster. Localstack is used in our CI Chainsaw suite for convenience and there can be disparity in S3 behaviours betwee it and Ceph. See docs/TESTING.md for information on how to run tests against a Ceph cluster.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Existing tests - this change fixes a very specific edge case which is difficult to reproduce.

@nolancon nolancon marked this pull request as ready for review November 1, 2024 13:59
internal/controller/bucket/acl.go Outdated Show resolved Hide resolved
internal/controller/bucket/lifecycleconfiguration.go Outdated Show resolved Hide resolved
internal/controller/bucket/objectlockconfiguration.go Outdated Show resolved Hide resolved
internal/controller/bucket/policy.go Outdated Show resolved Hide resolved
internal/controller/bucket/versioningconfiguration.go Outdated Show resolved Hide resolved
@nolancon nolancon merged commit b8522d4 into main Nov 4, 2024
10 checks passed
@nolancon nolancon deleted the lc-config-status-debug branch November 4, 2024 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants