Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: [Zero-Downtime] - safe migration #1890

Open
1 of 4 tasks
Tomasz-Smelcerz-SAP opened this issue Sep 25, 2024 · 7 comments
Open
1 of 4 tasks

feat: [Zero-Downtime] - safe migration #1890

Tomasz-Smelcerz-SAP opened this issue Sep 25, 2024 · 7 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@Tomasz-Smelcerz-SAP
Copy link
Member

Tomasz-Smelcerz-SAP commented Sep 25, 2024

Description

In order to introduce the zero-downtime procedure, we need a safe migration path.
By "safe migration path" I understand a setup, where a "revert" to the old behavior is as simple as possible - in case there is a bug in the new solution or it is not working as expected for whatever reason.
In particular, the revert should be as simple as switching some Lifecycle-Manager flags (runtime arguments) - the less, the better.

An Idea for how this could work:

The current solution uses root certificate secret as the Istio-Gateway secret directly.
Once the root cert is rotated, the LM code deletes the client secrets (that are based on the "previous" root), causing cert-manager to renew them.
The new ("zero-downtime") solution uses a dedicated secret for the Istio-Gateway, managed entirely by Lifecycle-Manager.
This new secret of course has a different name from the "root" secret, that is still managed and rotated by the cert-manager.
The dedicated secret decouples Istio-Gateway from changes to the root certificate - it is Lifecycle-Manager that decides when to propagate the changes.
How can we revert from the new solution to the old one?
If the secret name used for Istio-Gateway is different in the old (current) and new (future) solution, then in order to switch back we would have to change the Helm Charts, or at least the entry in the values.yaml. In addition we would probably need to deploy a different LM version - the one with the "old" logic. But then we're reverting the Lifecycle-Manager version, along with all the other features, security fixes etc.

To improve the situation, we should:

  • have a LM version that is capable of running either the old or the new code - depending on some feature flag.
  • the Istio Gateway secret name should be the same in both scenarios, to avoid manual tweaks in the Helm charts.

The first requirement is relatively easy - we just need to extract the relevant cert-management logic to a component and then provide two different implementations: the old one (current) and the new one. And we need a flag to decide which component should be actually used at runtime.

The second requirement is more tricky. In order to make it work, we should change the Lifecycle-Manager in the following way (it's just an idea, maybe it can be done in a simpler way):

  • The root certificate is no longer directly used to configure Istio Gateway
  • We introduce our own, managed secret that is a copy of the Istio-Gatewy secret.
  • There is a new "agent" that actively syncs the root secret to the copy.
  • Minimal changes in the code are required - The reconciliation logic no longer inspects the timestamps on the root secret, it inspects the created copy instead
  • Kustomize configuration of the Istio Gateway (and Helm charts) are changed so that the new secret is used instead of the "root" one.

By introducing this solution ("safe-migration"), we achieve the following:

  • Both the "safe-migration" and "final" solutions use the same secret name for the Istio-Gateway. Hence, when reverting, there's no need to change anything in the Helm charts.
  • The revert can be accomplished by switching a single boolean flag on the Lifecycle-Manager, like: --cert-mangament-legacy=true, assuming no additional configuration flags are required.
  • Don't worry about the "copy" or the new "agent" - in the final solution these also exists. And this issue introduces just a temporary solution. Once the final solution works, we'll remove the support for the "old" logic entirely.

Implementation notes for the syncing "agent" - for lack of a better name, let's call it: "Istio Gateway Secret Manager"

  • What we really need is something that watches the root secret and acts upon changes on this object
  • So it can be a single goroutine, a mini-controller, etc. Maybe even something else - a little research is required.
  • See that although the secret name is the same, the data in the secret differs. The "safe-migration" solution makes IstioGatewaySecret just a copy of the root secret, while the final solution uses a modified (newRoot+oldRoot) values for the caBundle in the IstioGatewaySecret.

See also: #1506

Reasons

Safe migration path - ability to revert fast and with minimal risk of doing it wrong - in case of troubles.

Implementation notes:

  • introduce a new mechanism for copying the secret. Can be a single goroutine with a watcher/informer. It should have a low-frequency polling in case k8s event is lost (AFAIK k8s events are not 100% reliable)
  • modify the condition upon which the skr watcher client secrets are removed:
    func (c *CertificateManager) RemoveSecretAfterCARotated(ctx context.Context, kymaObjKey client.ObjectKey) error {
  • if necessary, consider adding some annotation on the Istio Gateway Secret (IGS) object to make "rotation detection" in the code easy.

Acceptance Criteria

  • Double-check the proposed solution if it's not missing anything important
  • Verify with security team if the LM-managed secret is OK (we need this anyway for the "new solution")
  • introduce a secret manager component that does a copy
  • change the condition for watcher client secret removal so that it involves updates to the Istio Gataway Secret (which should be an up-to-date copy of the Root Secret)

Feature Testing

No response

Testing approach

No response

Attachments

No response

Related Issues:

#1430

@Tomasz-Smelcerz-SAP Tomasz-Smelcerz-SAP added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 25, 2024
@Tomasz-Smelcerz-SAP
Copy link
Member Author

Tomasz-Smelcerz-SAP commented Oct 10, 2024

@Tomasz-Smelcerz-SAP
Copy link
Member Author

Tomasz-Smelcerz-SAP commented Oct 10, 2024

Diagram: Kyma Reconciliation Logic after the changes proposed in this issue: https://www.mermaidchart.com/raw/f68fb76a-79db-4dac-a646-c99541407eed?theme=light&version=v0.1&format=svg

@LeelaChacha
Copy link
Contributor

An API-less controller for one secret does not make sense and the Informer mechanism is also not suitable for our purposes since we can only react on events on the secret (so it will be too late already, the CA certificate would have been already rotated)

AFAIK, a go routine started in the main.go is the most sensible approach. It can either sleep a fixed amount of time and check if the CA ceritifcate is close to rotation or it can look at the renewal period in the Certifcate resource and calculate a timedelta and sleep for that period (this is more efficient but it should be coded in a way that if KLM pod dies, the timedelta is always calculated anew to avoid missing the CA rotation)

@Tomasz-Smelcerz-SAP
Copy link
Member Author

Tomasz-Smelcerz-SAP commented Oct 17, 2024

@LeelaChacha I don't think I got your point. We don't don't need to know exactly when the Cert is rotated. We just need to know that it has been recently rotated. This is because we don't need the secret's "previous" value - but a new one, after the rotation. Then we copy this value to our "IstioGatewaySecret". And if we, for any reason, need the "previous" value, then we have it - in the existing IstioGatewaySecret - it was copied there on previous rotation or on the system bootstrap. So a single goroutine with a watch on the secret should be enough.

Now the question remains: "how quickly should we make the copy of the secret?".
The answer is: we can be quite "slow" or "lazy", if you prefer :)
I'd say that probably everything works fine even if we detect the change after few hours from the actual rotation. Why? Because nothing would really "break" - the Istio Gateway has a valid certificate and the clients have valid certificate. Remember that rotation doesn't invalidate the "old" cert. It is still valid until it's expiration time. And this expiration time is the only "hard" time boundary for our solution.
I assume rotation is configured to happen at least 24h before expiration so that gives us plenty of time.
But to provide some reasonable value for the first iteration: We should detect the rotation (and make a copy) in less than 5 minutes. This triggers the client secret migration procedure, and they have to be migrated before the "old" cert expires, otherwise they'll lose connectivity.

@Tomasz-Smelcerz-SAP
Copy link
Member Author

The current code with a condition for secret removal:

func (c *CertificateManager) RemoveSecretAfterCARotated(ctx context.Context, kymaObjKey client.ObjectKey) error {

@LeelaChacha
Copy link
Contributor

LeelaChacha commented Oct 23, 2024

@LeelaChacha We just need to know that it has been recently rotated. This is because we don't need the secret's "previous" value - but a new one, after the rotation. Then we copy this value to our "IstioGatewaySecret".

I get your point. But an informer is still not a good way to watch for the secret since an informer can execute a function by attaching an event handler to events of a kind only. That means all events on all secrets would trigger the event handler and waste resources until we can actually check the namespace and name of the secret to check if it is even relevant.

Furthermore, using Informers creates duplicated caches as it does not share the source.Kind caches that the controllers use. So we will double the memory usuage.

@LeelaChacha LeelaChacha self-assigned this Oct 23, 2024
@Tomasz-Smelcerz-SAP
Copy link
Member Author

Tomasz-Smelcerz-SAP commented Oct 25, 2024

@LeelaChacha I think there are ways to limit the number of objects you watch, in particular you can watch for events on just a single k8s resource. But I don't know how to do it using controller-runtime library ATM. Do you think we need some additional, short PoC for the "how to watch for the secret" problem?

Edit: I have found this - may be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants