Data race between clusterctl move and cluster controller #11812

dlipovetsky · 2025-02-06T20:23:10Z

What steps did you take and what happened?

Clusterctl move deletes the Cluster objects on the source cluster.

For each Cluster object clusterctl move

Pauses it
Removes its finalizers
Deletes it

cluster-api/cmd/clusterctl/client/cluster/mover.go

Lines 1233 to 1243 in a7cfa99

    
           if len(sourceObj.GetFinalizers()) > 0 { 
        
           	if err := cFrom.Patch(ctx, sourceObj, removeFinalizersPatch); err != nil { 
        
           		return errors.Wrapf(err, "error removing finalizers from %q %s/%s", 
        
           			sourceObj.GroupVersionKind(), sourceObj.GetNamespace(), sourceObj.GetName()) 
        
           	} 
        
           } 
        
           if err := cFrom.Delete(ctx, sourceObj); err != nil { 
        
           	return errors.Wrapf(err, "error deleting %q %s/%s", 
        
           		sourceObj.GroupVersionKind(), sourceObj.GetNamespace(), sourceObj.GetName()) 
        
           }

However, between (1) and (2), the cluster controller may add back its finalizer!

This is due to a combination of changes, namely

Handling finalizers early in the Reconcile 🌱 Handle finalizers early in Reconciles #11286
Handling pause differently 🌱 v1beta2 conditions: add function for setting the Paused condition #11284

After (1), we add the finalizer before we check if the Cluster is paused:

cluster-api/internal/controllers/cluster/cluster_controller.go

Lines 153 to 160 in 2974524

    
           // Add finalizer first if not set to avoid the race condition between init and delete. 
        
           if finalizerAdded, err := finalizers.EnsureFinalizer(ctx, r.Client, cluster, clusterv1.ClusterFinalizer); err != nil || finalizerAdded { 
        
           	return ctrl.Result{}, err 
        
           } 
        
           if isPaused, conditionChanged, err := paused.EnsurePausedCondition(ctx, r.Client, cluster, cluster); err != nil || isPaused || conditionChanged { 
        
           	return ctrl.Result{}, err 
        
           }

After (2), we no longer skip Reconcile if the Cluster is paused:

cluster-api/internal/controllers/cluster/cluster_controller.go

Lines 101 to 121 in 2974524

    
           WatchesRawSource(r.ClusterCache.GetClusterSource("cluster", func(_ context.Context, o client.Object) []ctrl.Request { 
        
           	return []ctrl.Request{{NamespacedName: client.ObjectKeyFromObject(o)}} 
        
           }, clustercache.WatchForProbeFailure(r.RemoteConnectionGracePeriod))). 
        
           Watches( 
        
           	&clusterv1.Machine{}, 
        
           	handler.EnqueueRequestsFromMapFunc(r.controlPlaneMachineToCluster), 
        
           	builder.WithPredicates(predicates.ResourceIsChanged(mgr.GetScheme(), predicateLog)), 
        
           ). 
        
           Watches( 
        
           	&clusterv1.MachineDeployment{}, 
        
           	handler.EnqueueRequestsFromMapFunc(r.machineDeploymentToCluster), 
        
           	builder.WithPredicates(predicates.ResourceIsChanged(mgr.GetScheme(), predicateLog)), 
        
           ). 
        
           Watches( 
        
           	&expv1.MachinePool{}, 
        
           	handler.EnqueueRequestsFromMapFunc(r.machinePoolToCluster), 
        
           	builder.WithPredicates(predicates.ResourceIsChanged(mgr.GetScheme(), predicateLog)), 
        
           ). 
        
           WithOptions(options). 
        
           WithEventFilter(predicates.ResourceHasFilterLabel(mgr.GetScheme(), predicateLog, r.WatchFilterValue)). 
        
           Build(r)

What did you expect to happen?

I expected clusterctl move to successfully delete the Cluster object on the source cluster.

Cluster API version

1.9.3

Kubernetes version

1.31.4

Anything else you would like to add?

I think we may be seeing this in our e2e tests. I'll look through other reports and link any I think might be related.

Thanks also to @dkoshkin for identifying the above PRs as potential sources of trouble.

Label(s) to be applied

/kind bug
/area clusterctl
/area cluster

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2025-02-06T20:23:20Z

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

dlipovetsky · 2025-02-06T20:35:41Z

I think there is one point of discussion: Should the cluster controller add a finalizer to a paused Cluster object?

(Intuitively, it seems that it should not add the finalizer, but perhaps there is some good reason I am not seeing.)

dlipovetsky · 2025-02-06T20:35:57Z

On a separate note, @faiq pointed out that clusterctl should delete the resource first, and only then remove the finalizers. That would prevent any data race, since finalizers cannot be added after the deletion timestamp is present.

cprivitere · 2025-02-06T21:00:15Z

Seems like this might be a possible cause of #11809 ?

dlipovetsky · 2025-02-06T21:53:50Z

Seems like this might be a possible cause of #11809 ?

I think you're right. And it's been flaking since 10/18/2024, just a few days after the above PRs merged.

sbueringer · 2025-02-07T06:28:14Z

I would prefer to first delete then remove finalizers.

Not adding the finalizer if an object is paused increases the chances of resource leaks in general
External controllers / users might not care about paused if they add their own finalizers

chrischdi · 2025-02-07T12:33:18Z

Note: with v1beta2 conditions: clusterctl could wait for the objects (which have the condition IsPaused v1beta2 condition) that the controller recognized them being paused. But the above still seems to be required, and this one not really is.

sbueringer · 2025-02-07T12:41:12Z

Yeah waiting for paused doesn't help. The controller intentionally always enforces the finalizer if deletionTimestamp is not set. Independent of if the Cluster is paused or not

dlipovetsky · 2025-02-07T16:16:28Z

Thanks for the insight!

I think we agree that the data race is by design, and that clusterctl must delete first, and only then remove finalizers.

Once #11814 merges (and we backport to 1.9.x), I think we can close this issue as completed.

sbueringer · 2025-02-07T18:11:11Z

Sounds good!

faiq mentioned this issue Feb 6, 2025

🐛 fix: send delete request before removing finalizers #11814

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data race between clusterctl move and cluster controller #11812

Data race between clusterctl move and cluster controller #11812

dlipovetsky commented Feb 6, 2025 •

edited

Loading

k8s-ci-robot commented Feb 6, 2025

dlipovetsky commented Feb 6, 2025 •

edited

Loading

dlipovetsky commented Feb 6, 2025

cprivitere commented Feb 6, 2025

dlipovetsky commented Feb 6, 2025

sbueringer commented Feb 7, 2025

chrischdi commented Feb 7, 2025 •

edited

Loading

sbueringer commented Feb 7, 2025 •

edited

Loading

dlipovetsky commented Feb 7, 2025 •

edited

Loading

sbueringer commented Feb 7, 2025

Data race between clusterctl move and cluster controller #11812

Data race between clusterctl move and cluster controller #11812

Comments

dlipovetsky commented Feb 6, 2025 • edited Loading

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

k8s-ci-robot commented Feb 6, 2025

dlipovetsky commented Feb 6, 2025 • edited Loading

dlipovetsky commented Feb 6, 2025

cprivitere commented Feb 6, 2025

dlipovetsky commented Feb 6, 2025

sbueringer commented Feb 7, 2025

chrischdi commented Feb 7, 2025 • edited Loading

sbueringer commented Feb 7, 2025 • edited Loading

dlipovetsky commented Feb 7, 2025 • edited Loading

sbueringer commented Feb 7, 2025

dlipovetsky commented Feb 6, 2025 •

edited

Loading

dlipovetsky commented Feb 6, 2025 •

edited

Loading

chrischdi commented Feb 7, 2025 •

edited

Loading

sbueringer commented Feb 7, 2025 •

edited

Loading

dlipovetsky commented Feb 7, 2025 •

edited

Loading