-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data race between clusterctl move and cluster controller #11812
Comments
This issue is currently awaiting triage. If CAPI contributors determine this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
I think there is one point of discussion: Should the cluster controller add a finalizer to a paused Cluster object? (Intuitively, it seems that it should not add the finalizer, but perhaps there is some good reason I am not seeing.) |
On a separate note, @faiq pointed out that clusterctl should delete the resource first, and only then remove the finalizers. That would prevent any data race, since finalizers cannot be added after the deletion timestamp is present. |
Seems like this might be a possible cause of #11809 ? |
I think you're right. And it's been flaking since 10/18/2024, just a few days after the above PRs merged. |
I would prefer to first delete then remove finalizers.
|
Note: with v1beta2 conditions: clusterctl could wait for the objects (which have the condition IsPaused v1beta2 condition) that the controller recognized them being paused. But the above still seems to be required, and this one not really is. |
Yeah waiting for paused doesn't help. The controller intentionally always enforces the finalizer if deletionTimestamp is not set. Independent of if the Cluster is paused or not |
Thanks for the insight! I think we agree that the data race is by design, and that clusterctl must delete first, and only then remove finalizers. Once #11814 merges (and we backport to 1.9.x), I think we can close this issue as completed. |
Sounds good! |
What steps did you take and what happened?
Clusterctl move deletes the Cluster objects on the source cluster.
For each Cluster object clusterctl move
cluster-api/cmd/clusterctl/client/cluster/mover.go
Lines 1233 to 1243 in a7cfa99
However, between (1) and (2), the cluster controller may add back its finalizer!
This is due to a combination of changes, namely
After (1), we add the finalizer before we check if the Cluster is paused:
cluster-api/internal/controllers/cluster/cluster_controller.go
Lines 153 to 160 in 2974524
After (2), we no longer skip Reconcile if the Cluster is paused:
cluster-api/internal/controllers/cluster/cluster_controller.go
Lines 101 to 121 in 2974524
What did you expect to happen?
I expected
clusterctl move
to successfully delete the Cluster object on the source cluster.Cluster API version
1.9.3
Kubernetes version
1.31.4
Anything else you would like to add?
I think we may be seeing this in our e2e tests. I'll look through other reports and link any I think might be related.
Thanks also to @dkoshkin for identifying the above PRs as potential sources of trouble.
Label(s) to be applied
/kind bug
/area clusterctl
/area cluster
The text was updated successfully, but these errors were encountered: