Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

talos_cluster_health: kubelet server certificate rotation is enabled, but CSR is not approved #206

Open
walnuss0815 opened this issue Oct 8, 2024 · 1 comment

Comments

@walnuss0815
Copy link

We are using the provider to deploy a two node k8s bare metal cluster.

We need the certificate rotation enabled for the metrics server. The Kubelet Serving Certificate Approver is being deployed using Argo CD and Argo CD is being deployed using Terraform right after the Talos cluster has been bootstrapped. Deploying the Kubelet Serving Certificate Approver via .cluster.extraManifests is not an option for us.

Without talos_cluster_health the deployment of Argo CD fails, because the k8s api is not ready. So in our case the health check is only required to ensure that the k8s api is ready for requests.

With talos_cluster_health the health check fails. On the first run it fails with missing static pods on node. On the second run it fails with kubelet server certificate rotation is enabled, but CSR is not approved.

First run
│ Warning: failed checks
│ 
│   with module.talos.data.talos_cluster_health.this,
│   on .terraform/modules/talos/main.tf line 118, in data "talos_cluster_health" "this":
│  118: data "talos_cluster_health" "this" {
│ 
│ waiting for etcd to be healthy: ...
│ waiting for etcd to be healthy: 1 error occurred:
│       * 192.168.x.y: service is not healthy: etcd
│ 
│ 
│ waiting for etcd to be healthy: OK
│ waiting for etcd members to be consistent across nodes: ...
│ waiting for etcd members to be consistent across nodes: OK
│ waiting for etcd members to be control plane nodes: ...
│ waiting for etcd members to be control plane nodes: OK
│ waiting for apid to be ready: ...
│ waiting for apid to be ready: OK
│ waiting for all nodes memory sizes: ...
│ waiting for all nodes memory sizes: OK
│ waiting for all nodes disk sizes: ...
│ waiting for all nodes disk sizes: OK
│ waiting for no diagnostics: ...
│ waiting for no diagnostics: OK
│ waiting for kubelet to be healthy: ...
│ waiting for kubelet to be healthy: 1 error occurred:
│       * 192.168.x.y service "kubelet" not in expected state "Running": current state [Preparing] Running pre state
│ 
│ 
│ waiting for kubelet to be healthy: 1 error occurred:
│       * 192.168.x.y: service is not healthy: kubelet
│ 
│ 
│ waiting for kubelet to be healthy: OK
│ waiting for all nodes to finish boot sequence: ...
│ waiting for all nodes to finish boot sequence: OK
│ waiting for all k8s nodes to report: ...
│ waiting for all k8s nodes to report: Get "https://192.168.x.y:6443/api/v1/nodes": dial tcp 192.168.x.y:6443: connect: connection refused
│ waiting for all k8s nodes to report: can't find expected node with IPs ["192.168.x.y"]
│ waiting for all k8s nodes to report: OK
│ waiting for all control plane static pods to be running: ...
│ waiting for all control plane static pods to be running: missing static pods on node 192.168.x.y: [kube-system/kube-apiserver kube-system/kube-controller-manager kube-system/kube-scheduler]
Second run
│ Warning: failed checks
│ 
│   with module.talos.data.talos_cluster_health.this,
│   on .terraform/modules/talos/main.tf line 118, in data "talos_cluster_health" "this":
│  118: data "talos_cluster_health" "this" {
│ 
│ waiting for etcd to be healthy: ...
│ waiting for etcd to be healthy: OK
│ waiting for etcd members to be consistent across nodes: ...
│ waiting for etcd members to be consistent across nodes: OK
│ waiting for etcd members to be control plane nodes: ...
│ waiting for etcd members to be control plane nodes: OK
│ waiting for apid to be ready: ...
│ waiting for apid to be ready: OK
│ waiting for all nodes memory sizes: ...
│ waiting for all nodes memory sizes: OK
│ waiting for all nodes disk sizes: ...
│ waiting for all nodes disk sizes: OK
│ waiting for no diagnostics: ...
│ waiting for no diagnostics: active diagnostics: 192.168.x.y: kubelet server certificate rotation is enabled, but CSR is not approved

With the Kubelet Serving Certificate Approver deployed manually after the k8s api is ready, the health check succeeds and Terraform starts deploying Argo CD.

main.tf
.
.
.

resource "talos_machine_bootstrap" "this" {
  depends_on = [talos_machine_configuration_apply.controlplane]

  client_configuration = talos_machine_secrets.this.client_configuration
  node                 = [for k, v in var.node_data.controlplanes : v.ip_address][0]
}

data "talos_cluster_health" "this" {
  depends_on = [talos_machine_bootstrap.this]

  client_configuration   = talos_machine_secrets.this.client_configuration
  control_plane_nodes    = [for k, v in var.node_data.controlplanes : v.ip_address]
  endpoints              = [for k, v in var.node_data.controlplanes : v.ip_address]
  skip_kubernetes_checks = true
}

resource "talos_cluster_kubeconfig" "this" {
  depends_on = [data.talos_cluster_health.this]

  client_configuration = talos_machine_secrets.this.client_configuration
  node                 = [for k, v in var.node_data.controlplanes : v.ip_address][0]
}

.
.
.

Our expectation is that the health check succeeds with kubelet server certificate rotation enabled and the Kubelet Serving Certificate Approver not deployed. Something like a minimal k8s readiness check would also be sufficient in our case.

@hegerdes
Copy link

I had a similar problem and solved it this way:

I do not use cluster_health. I just have a terraform time_sleep of 90s and then try to apply cilium and argocd. I do not wait till the cluster is ready, if its not after that time something has gone wrong anyway and it doen't matter if it fails at the cni/argo apply or at the health check.
The CSR not being approved is not a talos problem, kubernetes decided to ditch the automatic serving cert approver. I also used alex1989hu approver but then discovered that talos has a coloud-controller-manager that also ships a csr approver module. Since I also like to have all/most of my apps managed by argo I decided to alter alex1989hu approver to be a batch/job that runs once for the first 5 min. This is applied via cluster.extraManifests. After that I just use the argo managed talos ccm but you coluld also use alex1989hu approver as a argo managed deployment after the initial job finished.
Hope this helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants