From 4fffd3b5b92da6e3da50665fa9a817e09a63ad46 Mon Sep 17 00:00:00 2001 From: Hugo Blom <6117705+huxcrux@users.noreply.github.com> Date: Wed, 12 Jun 2024 16:04:12 +0200 Subject: [PATCH 1/2] Add autohealing information for kubernetes --- .../en/docs/kubernetes/guides/autohealing.md | 39 +++++++++++++++++++ 1 file changed, 39 insertions(+) create mode 100644 content/en/docs/kubernetes/guides/autohealing.md diff --git a/content/en/docs/kubernetes/guides/autohealing.md b/content/en/docs/kubernetes/guides/autohealing.md new file mode 100644 index 0000000..64f4389 --- /dev/null +++ b/content/en/docs/kubernetes/guides/autohealing.md @@ -0,0 +1,39 @@ +--- +title: "Auto Heling" +description: "Automatic Healing for Unresponsive or Failed Kubernetes Nodes" +weight: 5 +alwaysopen: true +--- + +In our Kubernetes Services, we have implemented a robust auto-healing mechanism to ensure the high availability and reliability of our infrastructure. This system is designed to automatically manage and replace unhealthy nodes, thereby minimizing downtime and maintaining the stability of our services. + +## Auto-Healing Mechanism + +### Triggers + +1. **Unready Node Detection**: + - The auto-healing process is triggered when a node remains in an "not ready" or "unknown" state for 15 minutes. + - This delay allows for transient issues to resolve themselves without unnecessary node replacements. + +2. **Node Creation Failure**: + - To ensure new nodes are given adequate time to initialize and join the cluster, we have configured startup timers: + - **Control Plane Nodes**: + - A new control plane node has a maximum startup time of 30 minutes. This extended period accounts for the critical nature and complexity of control plane operations. + - **Worker Nodes**: + - A new worker node has a maximum startup time of 10 minutes, reflecting the relatively simpler setup process compared to control plane nodes. + +### Actions + +1. **Unresponsive Node**: + - Once a node is identified as unready for the specified duration, the auto-healing system deletes the old node. + - Simultaneously, it initiates the creation of a new node to take its place, ensuring the cluster remains properly sized and functional. + +## Built-in Failsafe + +To prevent cascading failures and to handle scenarios where multiple nodes become unresponsive, we have a built-in failsafe mechanism: + +- **Threshold for Unresponsive Nodes**: + - If more than 35% of the nodes in the cluster become unresponsive simultaneously, the failsafe activates. + - This failsafe blocks any further changes, as such a widespread issue likely indicates a broader underlying problem, such as network or platform-related issues, rather than isolated node failures. + +By integrating these features, our Kubernetes Services can automatically handle node failures and maintain high availability, while also providing safeguards against systemic issues. This auto-healing capability ensures that our infrastructure remains resilient, responsive, and capable of supporting continuous service delivery. From 87bd3ba449ebb2ecdd27f9283cfcf2f458ba6042 Mon Sep 17 00:00:00 2001 From: Hugo Blom <6117705+huxcrux@users.noreply.github.com> Date: Wed, 12 Jun 2024 16:21:18 +0200 Subject: [PATCH 2/2] Update autohealing.md --- content/en/docs/kubernetes/guides/autohealing.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/en/docs/kubernetes/guides/autohealing.md b/content/en/docs/kubernetes/guides/autohealing.md index 64f4389..c040b8a 100644 --- a/content/en/docs/kubernetes/guides/autohealing.md +++ b/content/en/docs/kubernetes/guides/autohealing.md @@ -1,5 +1,5 @@ --- -title: "Auto Heling" +title: "Auto Healing" description: "Automatic Healing for Unresponsive or Failed Kubernetes Nodes" weight: 5 alwaysopen: true