Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup orphaned ManagedCluster resources after setting hubAcceptsClient to false #816

Open
jaswalkiranavtar opened this issue Jan 22, 2025 · 10 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@jaswalkiranavtar
Copy link
Contributor

Describe the enhancement

Currently if spoke sets hubAcceptsClient field to false on ManagedCluster CR the hub will cleanup all rbac resources and namespace from hub.

https://github.com/open-cluster-management-io/ocm/blob/main/pkg/registration/hub/managedcluster/controller.go#L162

We will also update this controller to cleanup IAM resources created by hub for this specific spoke. The only leftover resource will be ManagedCluster. We have a lot of temporary clusters which are created and destroyed every week for testing, which will lead to a lot of orphaned resources ManagedCluster resources.

To clean them up, we are planning to add a CronJob to OCM hub. @qiujian16 also suggested to create a contoller on hub side instead of cronjob. Creating this issue to get recommendation from qiujian and othe maintainers.

@jaswalkiranavtar jaswalkiranavtar added the enhancement New feature or request label Jan 22, 2025
@jaswalkiranavtar jaswalkiranavtar changed the title Cleanup ManagedCluster resources which have unjoined by setting hubAccepptsClient to false Cleanup orphaned ManagedCluster resources after setting hubAcceptsClient to false Jan 22, 2025
@dosubot dosubot bot added the question Further information is requested label Jan 22, 2025
@qiujian16
Copy link
Member

@zhiweiyin318 @elgnay thought on this?

@zhiweiyin318
Copy link
Member

zhiweiyin318 commented Jan 22, 2025

We have ResourceCleanup feature gate in the registration on the Hub, the addons, manifestworks and the RBAC related with the cluster will be cleaned up after the cluster is deleted if enable this feature gate.
We also defined a flag gc-resource-list which the user can define what resources are deleted by gc controller after the cluster is deleted if enable the feature gate, and the default resources in the list are addon and manifestwork. But we did not expose it in the clustermanager yet.
Not sure if this can help in this case?
If we are going to add IAM resource cleanup, we could consider to put this part into gc controller too https://github.com/open-cluster-management-io/ocm/blob/main/pkg/registration/hub/gc/controller.go

@jaswalkiranavtar
Copy link
Contributor Author

We want to trigger unjoin/deregister from spoke. If my understanding is correct spoke doesn't have permission to delete ManagedCluster resource on hub. So we thought, spoke can update hubAcceptsClient flag to false, which will trigger the whole cleanup.

The reasin we want to trigger from spoke is, we will have a long lived hub but many spokes can come up and go down during a week. So before we destroy spoke, we want to update the flag to false, so that all spoke specific resources are cleaned up from hub.

@zhiweiyin318
Copy link
Member

I think gc controller is the place where we do resources cleanup, we can put the resources cleanup code to there.
Currently deleting cluster can trigger resources cleanup in gc controller, I think it's ok to add false hubAcceptsClient in gc controller.

@jaswalkiranavtar
Copy link
Contributor Author

Ok thanks, will look at it.

@jaswalkiranavtar
Copy link
Contributor Author

jaswalkiranavtar commented Feb 3, 2025

@zhiweiyin318 @qiujian16 Also took another look at the managedcluster controller on hub and have a few observations.

  1. So, if we set hubAcceptsClient property to false on ManagedCluster CR, these rolebinding should be deleted. However, managedcluster-work-rolebinding.yaml rolebinding is not getting deleted because of a finalizer. We also see that this finalizer is getting removed here.
  2. Currently the spoke specific namespace is created on hub when the hubAcceptsClient field is set to true. However, it is not getting deleted along with other RBAC resources when hubAcceptsClient is set to false.
  3. Coming back to original question, we want to delete ManagedCluster resource and namespace when ManagedCluster has this condition. Please note that this condition only gets added when hubAcceptsClient is set to false on an already accepted ManagedCluster. Shall we update GC controller to delete namespace and ManagedCluster resource when above conditions are met. We can put this change behind a feature gate if needed.

@qiujian16
Copy link
Member

@jaswalkiranavtar

  1. the reason we do not remove work rolebinding is because if it is removed, manifestwork cannot be deleted anymore since the agent will not have the permission to remove the finalizer. So this rolebinding will not be removed untill all the manifestwork in the namespace is removed.
  2. yes namespace is not removed, the reason for not removing ns is in many cases, there could be other resources in this namespace, and deleting those resources may involving create a job or pod to do the cleanup in this ns. If this ns is in deleting state, these cleanup job/pod could not be created and some resources in this ns cannot be removed. Also for managedClusterAddon, some addon will use a pre-delete manfiestwork to do cleanup job, hence ns cannot be in deleting state either in this case.
  3. It is possible, but we need a better solution on ns removal. Something that we should consider:
    1. allow user so specify the condition that ns can be removed. e.g. if some resource still exists in the ns, the ns should not be removed.
    2. when hubAcceptClient=false, do we want to remove all manifestwork/managedclusteraddon in this namespace?
    3. a more careful removal sequence, ns can only be deleted when no manfiestwork/managedclusteraddon in this ns
    4. in addition to removing ns, we might also need an admission control that if a cluster is in deleting state, no manifestwork/managedclusteraddon should be created in the related ns.

@jaswalkiranavtar
Copy link
Contributor Author

  1. ii yes, i think ResourceCleanup feature gate automatically does that for us.

I think we can create a new feature gate which if enabled will trigger the cleanup of leftover ManagedCluster and namespace. The other things you are talking about i.e. i and ii will be taken care of by ResourceCleanup feature gate. Will also check if iv is already present as well.

@zhiweiyin318
Copy link
Member

I think there are 2 features here:

  1. delete cluster ns:
    The prerequisites for deleting cluster ns I can think of:
    1. cluster is deleting. (there are other resources applying in the cluster ns if the cluster is not deleting)
    2. there is no manifestwork/managedClusterAddon in the cluster.
    3. there is no CAPI cluster resources in the cluster ns (we just supported to import CAPI cluster).
    4. there is no other customized resources like Hive resources in the cluster ns.
    5. it is not proper to delete the cluster ns by OCM if the cluster ns is created by user .
  2. remove cluster resources if hubAcceptClient=false
    1. remove rabc resources . (work clusterrole and rolebinding should be deleted after there is no manifestwork, otherwise the manifestwork will be orphaned. )
    2. remove manfiestwork/managedclusteraddon (I don't think this is an expected behaviour for all users, for example in GitOps case, the cluster and manifestwork/managedClusteraddon are created together at once )
    3. remove cluster ns. (I don't think we can delete the cluster ns because there are several prerequisites for cluster ns deletion )

@jaswalkiranavtar
Copy link
Contributor Author

@qiujian16 @zhiweiyin318 Thanks for your comments. We will not make this change for now because of complexity involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants