Kafka Roller Refactoring #7625

robobario · 2022-11-14T22:36:52Z

robobario
Nov 14, 2022
Collaborator

There are several issues around the Kafka Roller that could warrant a re-implementation of the Roller.

Handling brokers in recovery after unclean shutdown: #5263
During KafkaAvailability checks topic descriptions are all loaded into memory.
Observations about the pod are spread across the KafkaRoller code, controller state, availability, restart reasons, stuck state detection.
Generally the class is doing a lot of work and decomposing it could help us reason about it.
We may want to add further complexity to handle rolling KRaft brokers once the nodes can be some combination of broker and controller.

General Questions:

Refactor or rewrite, could we feasibly refactor it to the better state step by step? Are there any responsibilities we could refactor out? For example we could potentially extract the 'observation' logic to gather up all the facts about the pod.
If we decide to rewrite would it warrant a feature gate so we could iterate on this without a full implementation up front?

Technical Discussion

Possible responsibilities that could be decomposed out:

fact gatherer: gather all the facts about the pod, is it stuck, does it need restart, is it a controller, does restart break availability etc.
next-node-selection: select the next node to restart (instead of nodes deferring their own restart)
planner: an alternative to the node-by-node selection might be adding a planner, this could be something slightly heavier where we could plug in ideas like "I can safely restart node X and Y in parallel.
pod reconciliation: something could be responsible for moving a pod from it's current state to 'healthy' or some other terminal state (without being concerned with the other nodes in the cluster).
cluster roller: orchestrates the whole roll: gather observations, decide if it should roll the cluster, then execute the roll as far as possible.

Some kind of next-pod-selector or planner could be warranted because we currently distribute this decision in KafkaRoller. We sort the node list to act on unready pods first (selection in a roundabout way), then while processing a pod the roller can decide to defer acting if the pod is controller or would break availability. This might be easier to reason about if there is one place that says, "first we want to act on unready pods, then pods that won't break availability, and finally the controller". It could also be a good place to put the logic that says "after acting on pod X, pod X remained unhealthy, stop the roll".

First Steps

The above questions might not be answerable until we've had a first shot at an implementation.

We could start by spiking out an implementation that does a subset of the work, like "restarts pods that need restarting, rolling the controller last" so we can put up some rough code for criticism.

scholzj · 2022-11-14T23:55:39Z

scholzj
Nov 14, 2022
Maintainer

I think doing a complete re-write might be tricky. It is connected to a lot of different parts of the operator, so there might be many conflicts. You also cannot roll it out with just some features. It needs to have at least parity with the current implementation.

0 replies

robobario · 2022-11-15T01:01:04Z

robobario
Nov 15, 2022
Collaborator Author

Thanks @scholzj.

Also there was an idea about implementing this as a state machine @tombentley

Did you imagine a state machine to bring a single pod to a healthy state, or a state machine for the whole cluster? So a whole cluster state machine would observe the entire cluster state and determine the next action to progress it (or bail out if something is unhealthy).

1 reply

tombentley Nov 15, 2022
Maintainer

Did you imagine a state machine to bring a single pod to a healthy state, or a state machine for the whole cluster?

I was thinking about having a state machine per broker. But we should be KRaft-aware and think in terms of server (which could be broker, controller or both). You can see a very hand-wavy bit of code here. The basic idea there is to separate out:

the collections of observations about all the brokers,
the classification of those observations to one of a small number of states
picking the server to act upon based on those states (i.e. pick the most dysfunctional broker first and restart/reconfigure that; this avoids a bad configuration being rolled out to healthy brokers and taking down the entire cluster over a number of reconciliations).

The main benefit I see from using an explicit state machine is simply that it makes it very easy to understand (such as in application logs) what's happening/happened if you log the state of all brokers when you take actions in the roller. In a way it's not really a state machine because you don't prohibit any transitions between states. i.e. we shouldn't build in assumptions about what is possible/impossible, but we do build in the logic for "if it's in state X then we do action Y".

robobario · 2022-11-15T01:59:14Z

robobario
Nov 15, 2022
Collaborator Author

Another question about the Roller. It currently does things like detect "stuck pods" or pods that it can't make an admin connection to and restarts them, having no awareness of log recovery. There is an idea proposed to expose to the operator that a broker is in recovery so that Roller won't restart it. Should restarting stuck/non-responsive pods be a Roller responsibility, or could we keep improving the Liveness/Readiness probe so they handle detecting these stuck cases so roller can focus on awaiting readiness?

I imagine there is some history of problems fixed by each behaviour of the roller. Maybe it fixes some problems that we can't detect from the probes?

1 reply

tombentley Nov 15, 2022
Maintainer

IIRC the "stuck pod" is trying capture the case where either:

Kube can't schedule a pod for any reason (e.g. too high memory request)
The pod can be scheduled, but ends up crashlooping

Aside: I have wondered whether point 1 could also be made less common by killing and restarting the broker within its existing container, thus avoiding the need to involve the kube scheduler at all (since the pod wouldn't get deleted). However, this interacts poorly with kube healthy probes, and the benefits don't seem worth the additional complexity.

What we wanted to avoid was the broker picking healthy pods to delete over successive reconciliations and eventually taking down the cluster. But the definition of a "stuck pod" was always a bit vague, it was based on what's broken in the past.

With the state machine idea this is achieved by:

Classifying pods appropriately
Ordering pods-to-be-rolled by their states (most dysfunctional first)

The Admin thing is also a bit vague. The idea was to infer that the broker isn't unhealthy because it's responding to a client, so you know you have network reachability at least. If this fails maybe deleting the pod will make things work (because who knows how reliable the networking stack that we're running on really is?). But there are possibly simpler ways of achieving this: We don't really need to use a separate Admin client instance, for example: It should be enough to use a describe broker configs request. This is also the same basic idea that the canary uses. It seems a bit wasteful to duplicate the logic for this kind of thing, and creates the possibility that the canary reaches a different conclusion than the broker. So it would be nice for these to converge eventually (i.e. I don't see it as a short term goal).

devguyio · 2022-11-15T09:57:26Z

devguyio
Nov 15, 2022
Collaborator

Some thoughts from my understanding based on @tombentley 's session and following discussions

Having a recovery decision made at pod level is dangerous "i.e. readiness/liveness probes" since you need to avoid unnecessary broker restarts. Do you have scenarios in mind where a correct decision can be at pod level in isolation of the cluster?
My understanding of the direction we're heading towards is
- Aggregate per-broker observation through k8s, ksense and the admin client.
- Infer/classify a per broker state.
- Aggregate all brokers states.
- Infer per broker rolling decision to proceed to the next state.
- Reconcile
- I think using a feature gate is an idea worth exploring. Part it might be some needed "prefactoring" but it feels feasible and not too much overhead.

So a whole cluster state machine would observe the entire cluster state and determine the next action to progress it (or bail out if something is unhealthy).

@tombentley 's proposed state machine (see below) is broker level. I wanted to get the status from @ShubhamRwt about his WIP and see if we (@ShubhamRwt , myself and any folks interested) can get a PoC, review it, and then explore enriching it that with a topic and partition state machine or maybe a cluster state machine "i.e in maintenance window, etc"

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

Kafka Roller Refactoring #7625

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Strimzi

Kafka Roller Refactoring #7625

robobario Nov 14, 2022 Collaborator

General Questions:

Technical Discussion

First Steps

Replies: 4 comments · 2 replies

scholzj Nov 14, 2022 Maintainer

robobario Nov 15, 2022 Collaborator Author

tombentley Nov 15, 2022 Maintainer

robobario Nov 15, 2022 Collaborator Author

tombentley Nov 15, 2022 Maintainer

devguyio Nov 15, 2022 Collaborator

robobario
Nov 14, 2022
Collaborator

Replies: 4 comments 2 replies

scholzj
Nov 14, 2022
Maintainer

robobario
Nov 15, 2022
Collaborator Author

tombentley Nov 15, 2022
Maintainer

robobario
Nov 15, 2022
Collaborator Author

tombentley Nov 15, 2022
Maintainer

devguyio
Nov 15, 2022
Collaborator