Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data race between clusterctl move and cluster controller #11812

Open
dlipovetsky opened this issue Feb 6, 2025 · 10 comments
Open

Data race between clusterctl move and cluster controller #11812

dlipovetsky opened this issue Feb 6, 2025 · 10 comments
Labels
area/cluster Issues or PRs related to clusters area/clusterctl Issues or PRs related to clusterctl kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@dlipovetsky
Copy link
Contributor

dlipovetsky commented Feb 6, 2025

What steps did you take and what happened?

Clusterctl move deletes the Cluster objects on the source cluster.

For each Cluster object clusterctl move

  1. Pauses it
  2. Removes its finalizers
  3. Deletes it

if len(sourceObj.GetFinalizers()) > 0 {
if err := cFrom.Patch(ctx, sourceObj, removeFinalizersPatch); err != nil {
return errors.Wrapf(err, "error removing finalizers from %q %s/%s",
sourceObj.GroupVersionKind(), sourceObj.GetNamespace(), sourceObj.GetName())
}
}
if err := cFrom.Delete(ctx, sourceObj); err != nil {
return errors.Wrapf(err, "error deleting %q %s/%s",
sourceObj.GroupVersionKind(), sourceObj.GetNamespace(), sourceObj.GetName())
}

However, between (1) and (2), the cluster controller may add back its finalizer!

This is due to a combination of changes, namely

  1. Handling finalizers early in the Reconcile 🌱 Handle finalizers early in Reconciles #11286
  2. Handling pause differently 🌱 v1beta2 conditions: add function for setting the Paused condition #11284

After (1), we add the finalizer before we check if the Cluster is paused:

// Add finalizer first if not set to avoid the race condition between init and delete.
if finalizerAdded, err := finalizers.EnsureFinalizer(ctx, r.Client, cluster, clusterv1.ClusterFinalizer); err != nil || finalizerAdded {
return ctrl.Result{}, err
}
if isPaused, conditionChanged, err := paused.EnsurePausedCondition(ctx, r.Client, cluster, cluster); err != nil || isPaused || conditionChanged {
return ctrl.Result{}, err
}

After (2), we no longer skip Reconcile if the Cluster is paused:

WatchesRawSource(r.ClusterCache.GetClusterSource("cluster", func(_ context.Context, o client.Object) []ctrl.Request {
return []ctrl.Request{{NamespacedName: client.ObjectKeyFromObject(o)}}
}, clustercache.WatchForProbeFailure(r.RemoteConnectionGracePeriod))).
Watches(
&clusterv1.Machine{},
handler.EnqueueRequestsFromMapFunc(r.controlPlaneMachineToCluster),
builder.WithPredicates(predicates.ResourceIsChanged(mgr.GetScheme(), predicateLog)),
).
Watches(
&clusterv1.MachineDeployment{},
handler.EnqueueRequestsFromMapFunc(r.machineDeploymentToCluster),
builder.WithPredicates(predicates.ResourceIsChanged(mgr.GetScheme(), predicateLog)),
).
Watches(
&expv1.MachinePool{},
handler.EnqueueRequestsFromMapFunc(r.machinePoolToCluster),
builder.WithPredicates(predicates.ResourceIsChanged(mgr.GetScheme(), predicateLog)),
).
WithOptions(options).
WithEventFilter(predicates.ResourceHasFilterLabel(mgr.GetScheme(), predicateLog, r.WatchFilterValue)).
Build(r)

What did you expect to happen?

I expected clusterctl move to successfully delete the Cluster object on the source cluster.

Cluster API version

1.9.3

Kubernetes version

1.31.4

Anything else you would like to add?

I think we may be seeing this in our e2e tests. I'll look through other reports and link any I think might be related.

Thanks also to @dkoshkin for identifying the above PRs as potential sources of trouble.

Label(s) to be applied

/kind bug
/area clusterctl
/area cluster

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/clusterctl Issues or PRs related to clusterctl area/cluster Issues or PRs related to clusters needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 6, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@dlipovetsky
Copy link
Contributor Author

dlipovetsky commented Feb 6, 2025

I think there is one point of discussion: Should the cluster controller add a finalizer to a paused Cluster object?

(Intuitively, it seems that it should not add the finalizer, but perhaps there is some good reason I am not seeing.)

@dlipovetsky
Copy link
Contributor Author

On a separate note, @faiq pointed out that clusterctl should delete the resource first, and only then remove the finalizers. That would prevent any data race, since finalizers cannot be added after the deletion timestamp is present.

@cprivitere
Copy link
Member

Seems like this might be a possible cause of #11809 ?

@dlipovetsky
Copy link
Contributor Author

Seems like this might be a possible cause of #11809 ?

I think you're right. And it's been flaking since 10/18/2024, just a few days after the above PRs merged.

@sbueringer
Copy link
Member

I would prefer to first delete then remove finalizers.

  1. Not adding the finalizer if an object is paused increases the chances of resource leaks in general
  2. External controllers / users might not care about paused if they add their own finalizers

@chrischdi
Copy link
Member

chrischdi commented Feb 7, 2025

Note: with v1beta2 conditions: clusterctl could wait for the objects (which have the condition IsPaused v1beta2 condition) that the controller recognized them being paused. But the above still seems to be required, and this one not really is.

@sbueringer
Copy link
Member

sbueringer commented Feb 7, 2025

Yeah waiting for paused doesn't help. The controller intentionally always enforces the finalizer if deletionTimestamp is not set. Independent of if the Cluster is paused or not

@dlipovetsky
Copy link
Contributor Author

dlipovetsky commented Feb 7, 2025

Thanks for the insight!

I think we agree that the data race is by design, and that clusterctl must delete first, and only then remove finalizers.

Once #11814 merges (and we backport to 1.9.x), I think we can close this issue as completed.

@sbueringer
Copy link
Member

Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster Issues or PRs related to clusters area/clusterctl Issues or PRs related to clusterctl kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

5 participants