Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only update cluster-ca-key-generation from CaReconciler #10749

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

katheris
Copy link
Contributor

Type of change

Select the type of your PR

  • Bugfix

Description

When rolling pods to trust new CA cert signed by new CA key only update the cluster-ca-key-generation
annotation. Do not update cluster-ca-cert-generation or clients-ca-cert-generation.

In PR #10711 I added logic to update the cluster-ca-cert-gen , clients-ca-cert-gen and cluster-ca-key-gen annotations before rolling the pods. However, the pods are being rolled to trust a new CA cert that is signed by a new CA key. They are still presenting certificates that were issued by the old CA cert.

This means that if the reconcile fails after rolling the Kafka pods as part of the CaReconciler rolling update, but before it runs the KafkaReconciler logic, when it comes back in it would incorrectly think that the new CA cert was used everywhere and remove the old one from the ca-cert Secret.

We should only update the cluster-ca-key-gen annotation before doing the rolling update

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
  • Supply screenshots for visual changes, such as Grafana dashboards

When rolling pods to trust new CA cert signed by new
CA key only update the cluster-ca-key-generation
annotation. Do not update cluster-ca-cert-generation
or clients-ca-cert-generation.

Signed-off-by: Katherine Stanley <[email protected]>
@katheris katheris added this to the 0.44.0 milestone Oct 22, 2024
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Regression tests should be run once the CI queues are empty.

Skip restarting pods that have an
up-to-date CA key annotation if no
other restart reasons present.

Signed-off-by: Katherine Stanley <[email protected]>
Comment on lines 587 to 588
LOGGER.debugCr(reconciliation, "Rolling Pod {} due to {}", pod.getMetadata().getName(), podRollReasons.getReasons());
return podRollReasons;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to wrap this into else branch to have return validation.

@@ -575,6 +575,15 @@ Future<Void> rollingUpdateForNewCaKey() {
false,
eventPublisher
).rollingRestart(pod -> {
if (podRollReasons.getReasons().size() == 1 && podRollReasons.contains(RestartReason.CLUSTER_CA_CERT_KEY_REPLACED)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Why do we even need the rolling update here for Clients CA key replacement?

  • It is not used internally, so we do not need to establish trust
  • It should be enough to roll this in the regular rolling update where it would anyway roll to change the annotation, or?

@katheris
Copy link
Contributor Author

I'm investigating the failure, I've worked out what's causing it, just trying to determine how best to address it

* Add explicit else clauses.
* Update KubernetesRestartEvent tests to
use the force-replace annotation to trigger
CA key replacement.

Signed-off-by: Katherine Stanley <[email protected]>
Comment on lines 581 to 571
if (podRollReasons.getReasons().size() == 1 && podRollReasons.contains(RestartReason.CLUSTER_CA_CERT_KEY_REPLACED)) {
// Check if this pod has been rolled previously
int clusterCaKeyGeneration = clusterCa.keyGeneration();
int podClusterCaKeyGeneration = Annotations.intAnnotation(pod, Ca.ANNO_STRIMZI_IO_CLUSTER_CA_KEY_GENERATION, clusterCaKeyGeneration);
if (clusterCaKeyGeneration == podClusterCaKeyGeneration) {
LOGGER.debugCr(reconciliation, "Not rolling Pod {} since the CA cert key generation is correct.", pod.getMetadata().getName());
return RestartReasons.empty();
} else {
LOGGER.debugCr(reconciliation, "Rolling Pod {} due to {}", pod.getMetadata().getName(), podRollReasons.getReasons());
return podRollReasons;
}
} else {
LOGGER.debugCr(reconciliation, "Rolling Pod {} due to {}", pod.getMetadata().getName(), podRollReasons.getReasons());
return podRollReasons;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, does this need to be updated to handle both reasons = both annotations? Otherwise, why are we patching the clients CA generation in the PodSet?

Update the CaReconciler to only roll pods
when the cluster CA key is replaced, not when
the clients CA key is replaced

Signed-off-by: Katherine Stanley <[email protected]>
@katheris
Copy link
Contributor Author

@scholzj @ppatierno I've pushed some updates to only roll when the cluster CA key is replaced, not the clients CA key (as we can leave that to the KafkaRoller) and only use the annotations, not the keyReplaced() method. I haven't done any manual testing yet, so take a look if you want but if you'd prefer to wait til I've done some manual testing for the clients CA key I'll update here when I've done that.

/**
* Clients CA private key was replaced
*/
CLIENT_CA_CERT_KEY_REPLACED("Trust new clients CA certificate signed by new key"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should be removed from docs as well (search for ClientCaCertKeyReplaced).


// cluster CA needs to be fully trusted
// it is coming from a cluster CA key replacement which didn't end successfully (i.e. CO stopped) and we need to continue from there
if (clusterCa.keyReplaced() || isClusterCaNeedFullTrust) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we are 100% sure the clusterCa.keyReplaced() is not needed :-o

@scholzj
Copy link
Member

scholzj commented Oct 24, 2024

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Oct 24, 2024

/azp run zookeeper-regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Oct 25, 2024

@katheris Looks like the tests related to CA renewal failed: SecurityST.testAutoReplaceClientsCaKeysTriggeredByAnno

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants