Replies: 4 comments 2 replies
-
I think doing a complete re-write might be tricky. It is connected to a lot of different parts of the operator, so there might be many conflicts. You also cannot roll it out with just some features. It needs to have at least parity with the current implementation. |
Beta Was this translation helpful? Give feedback.
-
Thanks @scholzj. Also there was an idea about implementing this as a state machine @tombentley Did you imagine a state machine to bring a single pod to a healthy state, or a state machine for the whole cluster? So a whole cluster state machine would observe the entire cluster state and determine the next action to progress it (or bail out if something is unhealthy). |
Beta Was this translation helpful? Give feedback.
-
Another question about the Roller. It currently does things like detect "stuck pods" or pods that it can't make an admin connection to and restarts them, having no awareness of log recovery. There is an idea proposed to expose to the operator that a broker is in recovery so that Roller won't restart it. Should restarting stuck/non-responsive pods be a Roller responsibility, or could we keep improving the Liveness/Readiness probe so they handle detecting these stuck cases so roller can focus on awaiting readiness? I imagine there is some history of problems fixed by each behaviour of the roller. Maybe it fixes some problems that we can't detect from the probes? |
Beta Was this translation helpful? Give feedback.
-
Some thoughts from my understanding based on @tombentley 's session and following discussions
@tombentley 's proposed state machine (see below) is broker level. I wanted to get the status from @ShubhamRwt about his WIP and see if we (@ShubhamRwt , myself and any folks interested) can get a PoC, review it, and then explore enriching it that with a topic and partition state machine or maybe a cluster state machine "i.e in maintenance window, etc" |
Beta Was this translation helpful? Give feedback.
-
There are several issues around the Kafka Roller that could warrant a re-implementation of the Roller.
General Questions:
Technical Discussion
Possible responsibilities that could be decomposed out:
Some kind of next-pod-selector or planner could be warranted because we currently distribute this decision in KafkaRoller. We sort the node list to act on unready pods first (selection in a roundabout way), then while processing a pod the roller can decide to defer acting if the pod is controller or would break availability. This might be easier to reason about if there is one place that says, "first we want to act on unready pods, then pods that won't break availability, and finally the controller". It could also be a good place to put the logic that says "after acting on pod X, pod X remained unhealthy, stop the roll".
First Steps
The above questions might not be answerable until we've had a first shot at an implementation.
We could start by spiking out an implementation that does a subset of the work, like "restarts pods that need restarting, rolling the controller last" so we can put up some rough code for criticism.
Beta Was this translation helpful? Give feedback.
All reactions