Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal] Support for installing Kubernetes apps using Omni #622

Open
smira opened this issue Sep 11, 2024 · 8 comments
Open

[proposal] Support for installing Kubernetes apps using Omni #622

smira opened this issue Sep 11, 2024 · 8 comments
Assignees

Comments

@smira
Copy link
Member

smira commented Sep 11, 2024

Rationale

Omni allows to define cluster fully via the cluster templates, which allows to install machines, bring them into the cluster, assert they are ready and healthy.
Cluster templates also allow to configure Talos Linux (and transitively, Kubernetes).

Sometimes there's an additional requirement to make cluster up and running, e.g.:

  • install Cilium CNI (supported way is via Helm)
  • bootstrap the initial Kubernetes app installer/updater (e.g. ArgoCD or Flux), which is also via Helm
  • install some additional applications if the installation is simple, and gitops flow via ArgoCD/Flux is not desired

Today the only way to install Kubernetes apps is by using Talos machine config "extra bootstrap manifests" feature, but this feature is not based on helm, so the installed manifests are not tracked as installed by helm, and can't be easily managed later by helm. This adds extra bloat to the Talos machine configuration, which is not needed.

Omni can be in a perfect position to manage Kubernetes apps in the cluster: it is a single instance (vs. Talos controlplane machines which can be multiple for a cluster), it already has information about cluster health (knows when it's safe to install), it already has a language to describe the cluster (cluster templates).

Proposed Solution

As much as we are not happy with Helm, Helm is the de-facto standard.

For the initial phase, in order to simplify things, let's limit ourselves to the initial installation of Helm charts (skipping upgrades, changing chart values, etc.), as this is more simple, less risky, and solves the immediate problem of fully bootstrapping the cluster. In the future work, we might support updating charts as well.

As cluster templates are text YAML files, we should try to preserve this simple approach friendly to version control, expansion, templating, etc. The proposal is to use Helmfile as a language to describe what has to be installed.

We can add a field strategy and force it to be set to bootstrap-only to indicate that right now the charts are installed only once.

The initial scope is to support only charts available to Omni without auth or special setup, that is Omni should be able to download the charts from public repositories.

Cluster templates should sync the Helm instructions to an Omni resource (per cluster) describing charts to be installed.

Omni should have a controller which watches cluster status, and as soon as the cluster is ready (Kubernetes API is available), performs helm installation. Omni keeps the status of the install, and if the install was done, and strategy is bootstrap-only, it skips any work on this cluster/Helm chart.

Omni might keep a cache of downloaded Helm charts.

Future Work

  • support updating/upgrading charts
  • support private Helm charts
  • support sharing Helmfile parts across clusters (i.e. enforcing a policy that e.g. cert-manager vX.Y should be installed for all clusters)
  • showing pending updates/scheduling updates, etc.
@utkuozdemir
Copy link
Member

I like it. My only concern would be Helmfile - I used it for a while on my homelab some years ago, but hit some issues and stopped using it. But I hope it's way better now, as it is actively developed.

My initial idea was to use Flux CD for it, but maybe we leave it to the cluster operators as it is way more complex and CRD based - Helmfile seems to give us the declarative language we need, without entering into the CRDs territory.

Another item in the future work could be, although it is loosely related, some sort of secrets management for these workloads.

@smira
Copy link
Member Author

smira commented Sep 11, 2024

Another item in the future work could be, although it is loosely related, some sort of secrets management for these workloads.

Yes, there's an issue: #572 .

I think we should support sops, includes and templating in cluster templates (but that deserves a separate issue)

@smira
Copy link
Member Author

smira commented Sep 11, 2024

Potential problems:

  • helmfile might shell out to helm and other tools (we need implement our own helm integration)

To avoid upgrades for each iteration of helm, the helmfile executable delegates to helm - as a result, helm must be installed.

  • support only Helm charts in Helmfile
  • ask helmfile devs about PR to make optionally use helm as a library

@rsmitty
Copy link
Member

rsmitty commented Sep 11, 2024

One random thought I had earlier about longer term implementation here. We should take care to design how we'll sync all clusters if we decide to support ongoing rollouts. In the case of, say, 1000+ clusters using this feature, we should make sure that if we sync every 15m or so we should have some random splay or batching or some other mechanism so that Omni isn't trying to update all 1000+ at once.

Totally not for the initial work here, but just wanted to capture it somewhere.

@smira
Copy link
Member Author

smira commented Sep 12, 2024

Totally not for the initial work here, but just wanted to capture it somewhere.

Good point, this should mostly work by design, as the controller has a fixed set of worker slots, the concurrency of the operation should be controlled by the number of slots in the controller applying Helmfiles.

@smira
Copy link
Member Author

smira commented Sep 13, 2024

I can't say that I like it, but another idea might be to run something like helmfile-controller inside the workload cluster configured by a ConfigMap for example, and Omni simply pushes the ConfigMap, and waits for the controller to do its job.

This might simplify some requirements (e.g. having different versions of the controller), or having access to private helm charts, but it takes away some resources from the workload cluster, the controller has to run with host networking (to install CNI), etc.

@utkuozdemir
Copy link
Member

I can't say that I like it, but another idea might be to run something like helmfile-controller inside the workload cluster configured by a ConfigMap for example, and Omni simply pushes the ConfigMap, and waits for the controller to do its job.

This might simplify some requirements (e.g. having different versions of the controller), or having access to private helm charts, but it takes away some resources from the workload cluster, the controller has to run with host networking (to install CNI), etc.

If we decided to go that route, we could use flux instead. I'd rather leave those things to the cluster operator, and do the helmfile part completely from Omni, so the clusters would stay "vanilla".

@kenlasko
Copy link

kenlasko commented Oct 21, 2024

Another note regarding the strategy field with bootstrap-only. We should make sure that bootstrap-only only triggers on the initial deployment. It is my understanding that the current inlineManifests options are triggered upon upgrades as well, even though the docs imply its only on bootstrap. This could be problematic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants