Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cilium): add taint and install cilium first - WIP #275

Merged
merged 4 commits into from
Aug 27, 2022

Conversation

t1murl
Copy link
Contributor

@t1murl t1murl commented Aug 13, 2022

This is a WIP PR to fix #270 .

I've just had a few minutes to look at this so far so please consider this as me sharing my idea.

When I've had more time to test and finalize, I'll mark this ready for review.
However, please do not hesitate to edit if someone else finds time earlier than me 👍 .

I've added ToDo comments where I left off.

Edit: Done force pushing, sorry for that.

@t1murl t1murl force-pushed the master branch 4 times, most recently from e967687 to c053b3e Compare August 14, 2022 03:08
@mysticaltech
Copy link
Collaborator

Very good start @t1murl, thank you :) Just would love you to keep the HelmChart and values logic, as it's ideal IMHO. If not, why?

@t1murl
Copy link
Contributor Author

t1murl commented Aug 14, 2022

@mysticaltech , thanks! 👍
The k3s HelmChart controller spawns a pod to perform the helm chart installation. If we taint the node with the cilium NoExecute taint the aforementioned pod does not tolerate the taint and will be stuck pending (so far I have not found a way to add tolerations to the pod spawned by the controller), leading to the chart never being installed. Please note that with my current draft you do not loose the customization through values option.

@t1murl
Copy link
Contributor Author

t1murl commented Aug 14, 2022

Also just noticed the k3s docs stating:

K3s agents can be configured with the options --node-label and --node-taint which adds a label and taint to the kubelet. The two options only add labels and/or taints at registration time, so they can only be added once and not changed after that again by running K3s commands.

It seems like the taint is not re-applied after a node reboot, potentially leaving the initial problem description unresolved by what I've done so far.

I'll look up what we may be able to do with e.g. kured, i.e. before it issues the reboot it should taint the node, to extend the current draft.

@mysticaltech
Copy link
Collaborator

Got it! We can execute arbitrary commands with Kured. So maybe a custom bash would help resolve the situation?!

@mysticaltech
Copy link
Collaborator

Or a startup script for Cilium nodes, MicroOS offers that kind of functionality easily.

@t1murl
Copy link
Contributor Author

t1murl commented Aug 14, 2022

@mysticaltech , that also sounds promising! I am not really familiar with MicroOS.. I reckon you mean this? Any help on this is more than welcome :)

@mysticaltech
Copy link
Collaborator

@t1murl Combustion or Ignition are for first boot. Am thinking more in terms of https://www.suse.com/c/easy-running-scripts-boot-and-shutdown/

@t1murl
Copy link
Contributor Author

t1murl commented Aug 14, 2022

FYI: Tainting the node while cilium is still running will result in cilium instantly deleting the taint again. I feel like we might want to taint the node when it is in a state where cilium is not running (anymore) and before pods start coming up (again).

@t1murl
Copy link
Contributor Author

t1murl commented Aug 14, 2022

@t1murl Combustion or Ignition are for first boot. Am thinking more in terms of https://www.suse.com/c/easy-running-scripts-boot-and-shutdown/

@mysticaltech From quickly reading into this, this indeed sounds great to me for correctly timing the tainting! Very good idea in my opinion 🙂👍

I feel like we should try this. Only need to find a neat way to run e.g. kubectl taint node $(hostname) "node.cilium.io/agent-not-ready:NoExecute" or the respective API call. Currently there does not seem to be a kubeconfig ready to use on worker nodes.

@mysticaltech
Copy link
Collaborator

Perfect!! Yep, sounds good 😊

About kubeconfig, it is already being downloaded from the first control plane node, if I remember correctly in init.tf, so we could reupload that to worker nodes for instance!

@mysticaltech
Copy link
Collaborator

@t1murl I just came back from a break; I'll try to help with this ASAP.

@mysticaltech
Copy link
Collaborator

@PurpleBooth If you have some bandwidth, you know how much your contributions are appreciated! :) 🙏

@PurpleBooth
Copy link
Contributor

PurpleBooth commented Aug 24, 2022

Let me reflect back my understanding of the problem here.

What we're trying to avoid containers that are not managed by the Cillium i.e. they were launched before cillium was fully initialised? From what I understand Cillium uses two techniques for this, firstly you can use this taint, or secondly you can restart the pods. We probably don't want to be randomly restarting pods, so the taint is a nicer option.

However adding this taint means that the HelmController won't schedule the pod because the tolerations for bootstrap: true do not include the special taint. So we need a way to get the helm chart to run.

The approach here is to use the helm CLI rather than the controller. Other approaches might include, adding a pod to add the toleration to the controller, to use the cilium cli (which is basically the same at the helm chart), or we could render using helm template somewhere in this module

Dropped a question over at the HelmChart repository to see if they have any thoughts k3s-io/helm-controller#151 I find it hard to believe we are the only ones to have experienced this

@mysticaltech
Copy link
Collaborator

@PurpleBooth I did not realize that the Helm controller was in play even after Cilium is installed. Good catch!

@t1murl Does that make sense to you?

@t1murl
Copy link
Contributor Author

t1murl commented Aug 25, 2022

@PurpleBooth I did not realize that the Helm controller was in play even after Cilium is installed. Good catch!

@t1murl Does that make sense to you?

@PurpleBooth , thanks for sharing your reflection on this problem and thanks for opening k3s-io/helm-controller#151! 👍
I feel this will help to understand if / how we can use the HelmController to install the Cilium chart instead of my proposal using plain helm + kubectl apply (or one of the other approaches you mentioned that avoids the HelmController) when tainting the nodes.
Please note that with the proposal in this MR, the initial deployment is working as expected (simply by installing Cilium before anything else). While the Cilium pods are starting, the HelmController pods (that do not tolerate the taint) for installing the other stuff patiently wait pending until the taint is removed by Cilium when it is ready - so no issues here.

We still have to implement something that accounts for node reboots after the initial deployment, i.e. the problem description I have provided in #270 where I've described how pods become unmanaged by Cilium after a node reboot by e.g. kured because they have been scheduled already and start faster than the Cilium pods when the nodes comes back up.

To account for this, my understanding of a potentially feasible solution would be to use @mysticaltech proposal mentioned in #275 (comment) to realize tainting the node again (required due to #275 (comment)) when a reboot occurs (also see #275 (comment) regarding the timing & #275 (comment)). Of course this only accounts for
'safe/regular' node reboots and not power loss or unexpected shutdowns that bypass @mysticaltech proposed solution of a shutdown script - which I would find acceptable.

cc: @mysticaltech

PS: Unfortunately even my weekends are busy currently (birthdays & weddings to attend) so more days will pass until I can personally try out @mysticaltech proposal. Happy to hear you are back @mysticaltech to assist with this! 👍

@mysticaltech
Copy link
Collaborator

@t1murl Thanks for clarifying! Helps a lots. As soon as any of us starts working on this, let's signal it here, so that we do not overlap on the same things. I'll do my best ASAP, if @PurpleBooth doesn't get this sorted before (she tends to be fast when she starts!) And looking forward for a reply to the issue she opened.

@PurpleBooth
Copy link
Contributor

For restarts we could do use Kured https://github.com/weaveworks/kured#adding-node-labels-before-and-after-reboots

@mysticaltech
Copy link
Collaborator

Excellent find, @PurpleBooth; That's another obstacle out of the way!!

@mysticaltech
Copy link
Collaborator

mysticaltech commented Aug 26, 2022

I believe it's now worth tweaking this PR and giving it a try as is. Maybe the Cilium containers will start after all. And if they successfully remove the label, that's it unless I do not see things as clearly as you. It's also important to remember that not all nodes restart simultaneously, and in an HA setting, at least two CP nodes would already have been bootstrapped and have Cilium running on them. @PurpleBooth What do you think?

@t1murl
Copy link
Contributor Author

t1murl commented Aug 26, 2022

@PurpleBooth , @mysticaltech to avoid confusion: node labels != node taints :O Afaik node labels are not the same thing as node taints.
Having kured apply a node label will not prevent pods starting before the Cilium pods are ready.
Ciliums answer for this situation is a node taint, as mentioned before (https://docs.cilium.io/en/v1.12/gettingstarted/taints/#taint-effects). A taint is what causes pods that do not tolerate it to not be scheduled, or in our case executed when tainting with :NoExecute as per the Cilium recommendation.

I've already considered what we could do with kured, however I've not found a way to have it add a taint in a way that would help us in the context of this issue out of the box. I've then thought about why not have it execute a local script to taint the node before it issues the reboot cmd (as also mentioned in #275 (comment)). I've found this not feasible, however, due to #275 (comment).

I hope this helps :)

@mysticaltech
Copy link
Collaborator

@t1murl My bad, I was confusing the two!

@mysticaltech
Copy link
Collaborator

mysticaltech commented Aug 27, 2022

@t1murl @PurpleBooth If the Cilium cli can solve our issues, then let's move away from the helm controller if needed. We can install it on the CP nodes at the moment when the k3s SELinux RPM is installed so that no additional reboots are needed.

@mysticaltech
Copy link
Collaborator

mysticaltech commented Aug 27, 2022

And I have verified how to run startup scripts; it's basically via systemd, as explained here https://www.golinuxcloud.com/run-script-at-startup-boot-without-cron-linux/.

As said above, the kubeconfig.yaml is already being downloaded to the root of the user's module, so we have it in Terraform as a variable; it just needs to be pushed to agent nodes that don't have it.

I also believe it's worth ignoring the tolerations of the Helm controller for now. Let's see if Cilium does not run if its taints are applied. If wich case, the wisest solution would be to move away from the Helm Controller towards the Cilium cli to install and run Cilium.

@mysticaltech
Copy link
Collaborator

mysticaltech commented Aug 27, 2022

Good news folks, there is a good chances that custom tolerations can be specified via the helm chart, in our case via the HelmChart definition. See this example for Longhorn, and other Rancher projects do the same. longhorn/longhorn#2664

@mysticaltech
Copy link
Collaborator

mysticaltech commented Aug 27, 2022

Last but not least, various tolerations that Cilium support, as also supporting by the helm controller jobs. https://github.com/k3s-io/helm-controller/search?q=toleration&type=commits

That said from what I gather custom tolerations in the helmchart yaml would go under "spec".

@mysticaltech
Copy link
Collaborator

@t1murl I will close this as I will redo most, to keep the current HelmChart method. But it's good inspiration, thank you!

@mysticaltech
Copy link
Collaborator

mysticaltech commented Aug 27, 2022

Changed my mind, merging instead so I have you in the git history @t1murl, it's very important to me as you've helped so much 🙏. I will compare from master and do the required changes.

@mysticaltech mysticaltech changed the base branch from master to staging August 27, 2022 16:44
@mysticaltech mysticaltech marked this pull request as ready for review August 27, 2022 16:45
@mysticaltech mysticaltech merged commit 57016ac into kube-hetzner:staging Aug 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cilium failing after automatic reboot
3 participants