Drain Cleaner causes cluster downtime on EKS #9932

Manicben · 2024-04-08T17:05:13Z

Manicben
Apr 8, 2024

Hey everyone!

We're experimenting with methods of making k8s node disruptions (such as node upgrades, termination, etc) less disruptive to our Strimzi-managed Kafka clusters. We believed that Drain Cleaner would help reduce periods of high E2E (message CreateTime to message being consumed and processed) due to giving Kafka enough time to move partition leadership gracefully. Replication doesn't matter as much here as we use acks=all for all brokers and we have no underreplicated partitions prior to rolling out nodes.

Before using Drain Cleaner, we see short latency spikes when brokers go down and come back up. But with Drain Cleaner, it makes things a lot worse, we even get read-only partitions due to 2 brokers going down at the same time due to exceeding the eviction timeout.

We'd like to know if we've configured the Cluster Operator correctly for using Drain Cleaner as intended. Happy to provide any relevant CO or Kafka CR configuration if needed, but we can't see any that would be relevant here upon first glance. Any assistance or debugging tips would be appreciated!

Here is our setup:

Drain Cleaner 1.0.1
- 3 replicas and 1 CPU, 1Gi memory (just to remove resource constraints)
Strimzi Cluster Operator 0.39
- Running with 2 replicas and leader election enabled
Kafka 3.6.1
- 3 broker replicas, 3 ZK replicas, broker pods are co-located with ZK pods, spread across k8s nodes
- Replication Factor 3
- Min ISR 2
AWS EKS v1.26.14-eks-b9c9ed7 with nodes managed by Auto Scaling groups
- EKS uses the default k8s pod eviction timeout of 5 mins. We assume that after this timeout, the node will be forcefully shut down
- Node rollouts are being triggered by Auto Scaling group Instance Refreshes via the AWS Console, ensuring "Enable skip matching" is disabled to force all nodes to be refreshed.

Timeline of events (Not attaching logs for brevity and omitting regular reconciliation events. The terminating timestamps are an approximation from our readiness metrics):

- 12:38:00 - Load test starts
- 12:48:00 - Instance refresh of worker node Auto Scaling groups in all 3 AZs triggered
- 12:49:41 - Cluster Operator leadership change
- 13:00:59 - kafka-2, zookeeper-1 evicted and patched
- 13:01:36 - kafka-1 evicted and patched
- 13:01:37 - zookeeper-2 evicted and patched
- 13:01:43 - zookeeper-1 is not a leader
- 13:01:43 - zookeeper-2 is leader
- 13:01:43 - zookeeper-1 rolled due to annotation
- 13:01:45 - zookeeper-1 evicted and patched (again)
- 13:03:49 - zookeeper-1 scheduled
- 13:04:27 - zookeeper-1 ready
- 13:04:28 - zookeeper-2 rolled due to annotation
- 13:04:30 - zookeeper-2 evicted and patched (again)
- 13:05:14 - kafka-0 evicted and patched
- 13:05:16 - zookeeper-0 evicted and patched
- 13:06:30 - kafka-2 partitions leadership redistributed to kafka-1 and kafka-0
- 13:06:33 - zookeeper-2 scheduled
- 13:06:58 - Cluster Operator leadership change
- 13:07:04 - zookeeper-1 is leader
- 13:07:13 - zookeeper-2 ready
- 13:07:13 - zookeeper-0 rolled due to annotation
- 13:07:15 - zookeeper-0 evicted and patched (again)
- 13:07:30 - kafka-1 partitions leadership redistributed to kafka-0
- 13:09:00 - kafka-1 and kafka-2 terminating
- 13:09:20 - zookeeper-0 scheduled
- 13:09:25 - kafka-1 evicted and patched (again)
- 13:09:29 - kafka-2 evicted and patched (again)
- 13:09:56 - zookeeper-0 ready
- 13:11:00 - 0 partitions have leaders
- 13:13:00 - kafka-0 terminating
- 13:13:08 - kafka-0 evicted and patched (again)
- 13:14:59 - kafka-2 exceeded 300000ms timeout to become ready
- 13:14:59 - kafka-2 and entire Kafka cluster was not reachable by the KafkaRoller
- 13:14:59 - kafka-2 rolled due to above issue
- 13:15:51 - kafka-2 scheduled
- 13:16:51 - kafka-2 ready
- 13:16:52 - kafka-1 skipped verifying up-to-date, kafka-1 is the active controller, retrying after 250ms
- 13:19:54 - kafka-0 skipped verifying up-to-date, kafka-0 cannot be updated right now, retrying after 250ms
- 13:19:54 - kafka-1 skipped verifying up-to-date, kafka-1 is the active controller, retrying after 500ms
- 13:19:55 - kafka-0 skipped verifying up-to-date, kafka-0 cannot be updated right now, retrying after 500ms
- 13:19:55 - kafka-1 skipped verifying up-to-date, kafka-1 is the active controller, retrying after 1000ms
- 13:19:57 - kafka-0 skipped verifying up-to-date, kafka-0 cannot be updated right now, retrying after 1000ms
- 13:19:57 - kafka-1 skipped verifying up-to-date, kafka-1 is the active controller, retrying after 2000ms
- 13:19:58 - kafka-0 rolled due to annotation
- 13:20:02 - kafka-0 scheduled
- 13:20:26 - kafka-0 ready
- 13:20:28 - kafka-1 rolled due to annotation
- 13:20:33 - kafka-1 scheduled
- 13:20:56 - kafka-1 ready
- 13:20:59 - kafka-0 skipped verifying up-to-date, kafka-0 cannot be updated right now, retrying after 250ms
- 13:21:08 - kafka-1 dynamic update successful
- 13:21:10 - kafka-2 skipped verifying up-to-date, kafka-2 cannot be updated right now, retrying after 250ms
- 13:21:10 - kafka-0 skipped verifying up-to-date, kafka-0 is the active controller, retrying after 500ms
- 13:21:12 - kafka-2 skipped verifying up-to-date, kafka-2 cannot be updated right now, retrying after 500ms
- 13:21:12 - kafka-0 skipped verifying up-to-date, kafka-0 is the active controller, retrying after 1000ms
- 13:21:14 - kafka-2 dynamic update successful
- 13:21:15 - kafka-0 dynamic update successful

Here are our observations:

Drain Cleaner is receiving eviction events and annotating pods correctly according to the logs. It also correctly detects that pods are already annotated.
We get multiple eviction events and patches for the same pods, but we only see these pods being rescheduled once. The Cluster Operator also only attempts to roll the pods once.
The time from Drain Cleaner claiming it has annotated a pod to the Cluster Operator actually rolling it varies hugely:
- For Zookeeper this can be <1 min to 3 mins, which is expected
- For Kafka this can be 14-15 mins!
We observed 2 brokers going down and coming back up at the same time, without any "Rolling Pod" logs from the Cluster Operator. We assume this was because the eviction had reached the 5 min timeout and the nodes were forcefully shut down.
We see volume re-attachment taking longer than usual for some reason in some, but not all, cases. We don't think this is caused by Drain Cleaner nor the Cluster Operator, but this could be making things worse by causing the Cluster Operator to time out waiting for pods to come up after being scheduled. We expect this to be due to new nodes not having been scaled up in the same AZ, we're just relying on AWS behaviour here.

Answered by scholzj

Apr 8, 2024

I guess it might be best to start addressing these one by one ...

EKS uses the default k8s pod eviction timeout of 5 mins. We assume that after this timeout, the node will be forcefully shut down

I think this is definitely an issue. Especially if you have both ZooKeeper and Kafka on the same node and evicted in parallel and the nodes being drained one by one. The rolling update triggered by the Drain Cleaner will normally take 0-120 seconds to start. Then it will first roll the ZooKeeper pod and only then the Kafka Pod. So I think managing this within 5 minutes cannot be guaranteed, especially if the Pod might need to wait for some data to resync before being rolled, would need to pull …

View full answer

Manicben · 2024-04-08T17:12:28Z

Manicben
Apr 8, 2024
Author

Here are some graphs of our experience without and with Drain Cleaner. E2E Latency percentiles are in seconds, the red Broker readiness line is in number of brokers.
Note that E2E latency spikes are also caused by nodes containing our applications being taken down and not just due to Kafka brokers going down. We may rerun our tests with Kafka brokers and Zookeeper nodes on a separate nodepool to isolate the noise caused by applications being rescheduled.

Without Drain Cleaner:
Run 1

Run 2

With Drain Cleaner (no other changes other than setting PDB maxUnavailable to 0 in the CR):

0 replies

scholzj · 2024-04-08T17:31:00Z

scholzj
Apr 8, 2024
Maintainer

I guess it might be best to start addressing these one by one ...

EKS uses the default k8s pod eviction timeout of 5 mins. We assume that after this timeout, the node will be forcefully shut down

I think this is definitely an issue. Especially if you have both ZooKeeper and Kafka on the same node and evicted in parallel and the nodes being drained one by one. The rolling update triggered by the Drain Cleaner will normally take 0-120 seconds to start. Then it will first roll the ZooKeeper pod and only then the Kafka Pod. So I think managing this within 5 minutes cannot be guaranteed, especially if the Pod might need to wait for some data to resync before being rolled, would need to pull new image etc. So I think this needs to be addressed first. I think you need to solve this first to separate the impact of the force killed pods from the regular draining.

3 broker replicas, 3 ZK replicas, broker pods are co-located with ZK pods, spread across k8s nodes

Maybe just as a sidenote -> there is no real reason to colocate the ZooKeeper pods and Kafka pods. There is nothing like that the Kafka broker will talk only with the colocated ZooKeeper node. So this is basically as good as when they are not colocated. Not having them colocated might smooth things out with regards to the timeout. But at the same time, ZooKeeper nodes are normally quite small and might not deserve its own Kubernetes worker node. So I'm not sure what is the best architecture in your case -> I just wanted to note that the colocation dos not bring much from Kafka's perspective.

3 replies

Manicben Apr 8, 2024
Author

Thanks for the swift response!

We'll look into the eviction timeout as you've suggested. It makes a lot of sense, we weren't sure how long the rolling update was meant to take and we weren't aware that Zookeeper is done first. Thanks for the insight.

As for the colocation, agreed, I believe we used to do it for a perceived networking benefit, but there's no getting around the cross-AZ network cost anyway. I'll discuss this with my team to see if we can remove it, as it certainly would make scheduling more flexible. In the future we'll migrate to KRaft anyway, so at that point all broker and controller nodes may live on their own nodepool without the need to colocate any of them together on the same node.

Thanks again and I'll report back after we've made changes and retested.

Manicben Apr 9, 2024
Author

We made the following changes:

Removed Zookeeper colocation from Kafka
Disabled Drain Cleaner support for Zookeeper via the env var and reverting the Zookeeper PDB

With these changes alone, Drain Cleaner support now works as expected. We see Drain Cleaner handling Kafka pod evictions correctly, and the CO swiftly rolling the evicted pods. In some cases it's almost too fast, our kube-state-metrics pod ready metric doesn't even see some Kafka pods terminating and restarting (the example below shows where 2 Kafka pod restarts didn't show up on the metric)!

With this, our issue has been resolved. Many thanks again for your help and insight @scholzj!

We still have some questions around Drain Cleaner, which will be raised after we've done more testing, but these will be raised in a separate Discussion.

scholzj Apr 9, 2024
Maintainer

Great, thanks for the update.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

Drain Cleaner causes cluster downtime on EKS #9932

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Strimzi

Drain Cleaner causes cluster downtime on EKS #9932

Manicben Apr 8, 2024

Replies: 2 comments · 3 replies

Manicben Apr 8, 2024 Author

scholzj Apr 8, 2024 Maintainer

Manicben Apr 8, 2024 Author

Manicben Apr 9, 2024 Author

scholzj Apr 9, 2024 Maintainer

Manicben
Apr 8, 2024

Replies: 2 comments 3 replies

Manicben
Apr 8, 2024
Author

scholzj
Apr 8, 2024
Maintainer

Manicben Apr 8, 2024
Author

Manicben Apr 9, 2024
Author

scholzj Apr 9, 2024
Maintainer