Building Availability SLO for Kafka Cluster Utilizing Strimzi-Canary Metrics #219

OuesFa · 2023-06-22T08:38:09Z

I am working towards constructing a Service Level Objective (SLO) for our Kafka cluster's availability using Strimzi-Canary metrics. The aim is to have two distinct resources for the SLO: one to monitor consumption and the other for production.

For the Production SLI (Service Level Indicator), the plan is to employ strimzi_canary_records_produced as the reference for total events and strimzi_canary_records_produced_failed for unsuccessful events.

However, when it comes to the Consumption SLI, there doesn't seem to be a direct equivalent metric for 'failed' events as in production. The closest metric I can find is consumer_error_total.

Would love to hear your thoughts on this approach and any suggestions on how I could effectively establish my Consumption SLO. Is there a more suitable method or metrics that I should consider?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building Availability SLO for Kafka Cluster Utilizing Strimzi-Canary Metrics #219

Building Availability SLO for Kafka Cluster Utilizing Strimzi-Canary Metrics #219

OuesFa commented Jun 22, 2023

Building Availability SLO for Kafka Cluster Utilizing Strimzi-Canary Metrics #219

Building Availability SLO for Kafka Cluster Utilizing Strimzi-Canary Metrics #219

Comments

OuesFa commented Jun 22, 2023