You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've deployed Pyrra on top of our observability stack and it seems to be working as expected from the operator perspective (I get PrometheusRules generated for my ServiceLevelObjectives, I see those metrics available when I query Grafana for them). For some reason, the Pyrra UI shows no data on the SLO-specific details pages.
What I've found so far
I've created an SLO using the example from the Pyrra repo:
apiVersion: pyrra.dev/v1alpha1kind: ServiceLevelObjectivemetadata:
name: pyrra-connect-errors namespace: monitoring labels:
prometheus: k8s role: alert-rulesspec:
target: '99' window: 2w description: Pyrra serves API requests with connect-go either via gRPC or HTTP. indicator:
ratio:
errors:
metric: connect_server_requests_total{job="pyrra",code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss"} total:
metric: connect_server_requests_total{job="pyrra"} grouping:
- service - method
It generates the following PrometheusRule:
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata:
annotations:
prometheus-operator-validated: "true" creationTimestamp: "2025-01-21T21:07:34Z" generation: 1 labels:
prometheus: k8s role: alert-rules name: pyrra-connect-errors namespace: monitoring ownerReferences:
- apiVersion: pyrra.dev/v1alpha1 controller: true kind: ServiceLevelObjective name: pyrra-connect-errors uid: 5b16534a-04c3-43f8-a24d-97038c9d2474 resourceVersion: "252446336" uid: 3131bfb1-6a6e-45d4-8104-b79f2901faffspec:
groups:
- interval: 1m30s name: pyrra-connect-errors-increase rules:
- expr: sum by (code, method, service) (increase(connect_server_requests_total{job="pyrra"}[2w])) labels:
job: pyrra slo: pyrra-connect-errors record: connect_server_requests:increase2w - alert: SLOMetricAbsent expr: absent(connect_server_requests_total{job="pyrra"}) == 1 for: 5m labels:
job: pyrra severity: critical slo: pyrra-connect-errors - interval: 30s name: pyrra-connect-errors rules:
- expr: sum by (method, service) (rate(connect_server_requests_total{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra"}[3m])) / sum by (method, service) (rate(connect_server_requests_total{job="pyrra"}[3m])) labels:
job: pyrra slo: pyrra-connect-errors record: connect_server_requests:burnrate3m - expr: sum by (method, service) (rate(connect_server_requests_total{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra"}[15m])) / sum by (method, service) (rate(connect_server_requests_total{job="pyrra"}[15m])) labels:
job: pyrra slo: pyrra-connect-errors record: connect_server_requests:burnrate15m - expr: sum by (method, service) (rate(connect_server_requests_total{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra"}[30m])) / sum by (method, service) (rate(connect_server_requests_total{job="pyrra"}[30m])) labels:
job: pyrra slo: pyrra-connect-errors record: connect_server_requests:burnrate30m - expr: sum by (method, service) (rate(connect_server_requests_total{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra"}[1h])) / sum by (method, service) (rate(connect_server_requests_total{job="pyrra"}[1h])) labels:
job: pyrra slo: pyrra-connect-errors record: connect_server_requests:burnrate1h - expr: sum by (method, service) (rate(connect_server_requests_total{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra"}[3h])) / sum by (method, service) (rate(connect_server_requests_total{job="pyrra"}[3h])) labels:
job: pyrra slo: pyrra-connect-errors record: connect_server_requests:burnrate3h - expr: sum by (method, service) (rate(connect_server_requests_total{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra"}[12h])) / sum by (method, service) (rate(connect_server_requests_total{job="pyrra"}[12h])) labels:
job: pyrra slo: pyrra-connect-errors record: connect_server_requests:burnrate12h - expr: sum by (method, service) (rate(connect_server_requests_total{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra"}[2d])) / sum by (method, service) (rate(connect_server_requests_total{job="pyrra"}[2d])) labels:
job: pyrra slo: pyrra-connect-errors record: connect_server_requests:burnrate2d - alert: ErrorBudgetBurn expr: connect_server_requests:burnrate3m{job="pyrra",slo="pyrra-connect-errors"} > (14 * (1-0.99)) and connect_server_requests:burnrate30m{job="pyrra",slo="pyrra-connect-errors"} > (14 * (1-0.99)) for: 1m0s labels:
exhaustion: 1d job: pyrra long: 30m severity: critical short: 3m slo: pyrra-connect-errors - alert: ErrorBudgetBurn expr: connect_server_requests:burnrate15m{job="pyrra",slo="pyrra-connect-errors"} > (7 * (1-0.99)) and connect_server_requests:burnrate3h{job="pyrra",slo="pyrra-connect-errors"} > (7 * (1-0.99)) for: 8m0s labels:
exhaustion: 2d job: pyrra long: 3h severity: critical short: 15m slo: pyrra-connect-errors - alert: ErrorBudgetBurn expr: connect_server_requests:burnrate1h{job="pyrra",slo="pyrra-connect-errors"} > (2 * (1-0.99)) and connect_server_requests:burnrate12h{job="pyrra",slo="pyrra-connect-errors"} > (2 * (1-0.99)) for: 30m0s labels:
exhaustion: 1w job: pyrra long: 12h severity: warning short: 1h slo: pyrra-connect-errors - alert: ErrorBudgetBurn expr: connect_server_requests:burnrate3h{job="pyrra",slo="pyrra-connect-errors"} > (1 * (1-0.99)) and connect_server_requests:burnrate2d{job="pyrra",slo="pyrra-connect-errors"} > (1 * (1-0.99)) for: 1h30m0s labels:
exhaustion: 2w job: pyrra long: 2d severity: warning short: 3h slo: pyrra-connect-errors
I can verify that the recording rules are created and contain data by querying them in Grafana:
When I access the Pyrra UI main page that lists the SLOs, I see the example SLO as the only thing in the list but it says there is no data.
If I click into the objective, I see incorrect or missing data:
If I look at the logs for the pyrra-api pod, I can see it making the following queries, first for the main page:
ALERTS{slo=~".+"}
sum by (service, method) (connect_server_requests:increase2w{job="pyrra",slo="pyrra-connect-errors"})
sum by (service, method) (connect_server_requests:increase2w{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra",slo="pyrra-connect-errors"})
I don't see any data for the ALERTS query, but the other two return data just fine if I query them myself through Grafana.
For the objective-specific page, here are the queries logged:
((1 - 0.99) - (sum(connect_server_requests:increase2w{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra",slo="pyrra-connect-errors"} or vector(0)) / sum(connect_server_requests:increase2w{job="pyrra",slo="pyrra-connect-errors"}))) / (1 - 0.99)
sum by (code) (rate(connect_server_requests_total{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra"}[5m])) / scalar(sum(rate(connect_server_requests_total{job="pyrra"}[5m]))) > 0
sum by (code) (rate(connect_server_requests_total{job="pyrra"}[5m])) > 0
ALERTS{slo="pyrra-connect-errors"}
ALERTS{slo=~".+"}
sum by (service, method) (connect_server_requests:increase2w{job="pyrra",slo="pyrra-connect-errors"})
sum by (service, method) (connect_server_requests:increase2w{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra",slo="pyrra-connect-errors"})
connect_server_requests:burnrate30m{job="pyrra",slo="pyrra-connect-errors"}
connect_server_requests:burnrate1h{job="pyrra",slo="pyrra-connect-errors"}
connect_server_requests:burnrate15m{job="pyrra",slo="pyrra-connect-errors"}
connect_server_requests:burnrate12h{job="pyrra",slo="pyrra-connect-errors"}
connect_server_requests:burnrate3h{job="pyrra",slo="pyrra-connect-errors"}
connect_server_requests:burnrate2d{job="pyrra",slo="pyrra-connect-errors"}
connect_server_requests:burnrate3m{job="pyrra",slo="pyrra-connect-errors"}
If I make these queries in Grafana, I see data for most or all of them. I'm just not seeing any data in the graphs in the Pyrra UI for the objective. Other things that seem strange in the UI:
the "Availability: Errors 0, Total 1" (I'd expect that the total would be related to the query connect_server_requests_total{job="pyrra"} defined in the SLO, which returns 166 requests across different services/methods when checked in Grafana).
the multirate burndown list showing "NaN" makes me think it's not getting the data I think it is from the above queries
For context, our setup:
We have a central Mimir deployment that replaces Prometheus
We have OpenTelemetry Collectors monitoring for Prometheus CRDs such as PodMonitors, ServiceMonitors, and PrometheusRules. These collectors are responsible for configuring and collecting everything, and we expect the common interface for services that expose metrics to be those Prometheus CRDs (no service or collector talks directly to Mimir natively)
We query and visualize our metrics in a Grafana instance backed to Mimir.
What's not working:
I've deployed Pyrra on top of our observability stack and it seems to be working as expected from the operator perspective (I get PrometheusRules generated for my ServiceLevelObjectives, I see those metrics available when I query Grafana for them). For some reason, the Pyrra UI shows no data on the SLO-specific details pages.
What I've found so far
I've created an SLO using the example from the Pyrra repo:
It generates the following PrometheusRule:
I can verify that the recording rules are created and contain data by querying them in Grafana:
When I access the Pyrra UI main page that lists the SLOs, I see the example SLO as the only thing in the list but it says there is no data.
If I click into the objective, I see incorrect or missing data:
If I look at the logs for the pyrra-api pod, I can see it making the following queries, first for the main page:
I don't see any data for the ALERTS query, but the other two return data just fine if I query them myself through Grafana.
For the objective-specific page, here are the queries logged:
If I make these queries in Grafana, I see data for most or all of them. I'm just not seeing any data in the graphs in the Pyrra UI for the objective. Other things that seem strange in the UI:
connect_server_requests_total{job="pyrra"}
defined in the SLO, which returns 166 requests across different services/methods when checked in Grafana).For context, our setup:
The text was updated successfully, but these errors were encountered: