Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyrra's UI isn't showing any data despite metrics being available #1397

Open
mtthwcmpbll opened this issue Jan 29, 2025 · 0 comments
Open

Pyrra's UI isn't showing any data despite metrics being available #1397

mtthwcmpbll opened this issue Jan 29, 2025 · 0 comments

Comments

@mtthwcmpbll
Copy link

What's not working:

I've deployed Pyrra on top of our observability stack and it seems to be working as expected from the operator perspective (I get PrometheusRules generated for my ServiceLevelObjectives, I see those metrics available when I query Grafana for them). For some reason, the Pyrra UI shows no data on the SLO-specific details pages.

What I've found so far

I've created an SLO using the example from the Pyrra repo:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: pyrra-connect-errors
  namespace: monitoring
  labels:
    prometheus: k8s
    role: alert-rules
spec:
  target: '99'
  window: 2w
  description: Pyrra serves API requests with connect-go either via gRPC or HTTP.
  indicator:
    ratio:
      errors:
        metric: connect_server_requests_total{job="pyrra",code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss"}
      total:
        metric: connect_server_requests_total{job="pyrra"}
      grouping:
        - service
        - method

It generates the following PrometheusRule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
    prometheus-operator-validated: "true"
  creationTimestamp: "2025-01-21T21:07:34Z"
  generation: 1
  labels:
    prometheus: k8s
    role: alert-rules
  name: pyrra-connect-errors
  namespace: monitoring
  ownerReferences:
  - apiVersion: pyrra.dev/v1alpha1
    controller: true
    kind: ServiceLevelObjective
    name: pyrra-connect-errors
    uid: 5b16534a-04c3-43f8-a24d-97038c9d2474
  resourceVersion: "252446336"
  uid: 3131bfb1-6a6e-45d4-8104-b79f2901faff
spec:
  groups:
  - interval: 1m30s
    name: pyrra-connect-errors-increase
    rules:
    - expr: sum by (code, method, service) (increase(connect_server_requests_total{job="pyrra"}[2w]))
      labels:
        job: pyrra
        slo: pyrra-connect-errors
      record: connect_server_requests:increase2w
    - alert: SLOMetricAbsent
      expr: absent(connect_server_requests_total{job="pyrra"}) == 1
      for: 5m
      labels:
        job: pyrra
        severity: critical
        slo: pyrra-connect-errors
  - interval: 30s
    name: pyrra-connect-errors
    rules:
    - expr: sum by (method, service) (rate(connect_server_requests_total{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra"}[3m]))
        / sum by (method, service) (rate(connect_server_requests_total{job="pyrra"}[3m]))
      labels:
        job: pyrra
        slo: pyrra-connect-errors
      record: connect_server_requests:burnrate3m
    - expr: sum by (method, service) (rate(connect_server_requests_total{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra"}[15m]))
        / sum by (method, service) (rate(connect_server_requests_total{job="pyrra"}[15m]))
      labels:
        job: pyrra
        slo: pyrra-connect-errors
      record: connect_server_requests:burnrate15m
    - expr: sum by (method, service) (rate(connect_server_requests_total{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra"}[30m]))
        / sum by (method, service) (rate(connect_server_requests_total{job="pyrra"}[30m]))
      labels:
        job: pyrra
        slo: pyrra-connect-errors
      record: connect_server_requests:burnrate30m
    - expr: sum by (method, service) (rate(connect_server_requests_total{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra"}[1h]))
        / sum by (method, service) (rate(connect_server_requests_total{job="pyrra"}[1h]))
      labels:
        job: pyrra
        slo: pyrra-connect-errors
      record: connect_server_requests:burnrate1h
    - expr: sum by (method, service) (rate(connect_server_requests_total{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra"}[3h]))
        / sum by (method, service) (rate(connect_server_requests_total{job="pyrra"}[3h]))
      labels:
        job: pyrra
        slo: pyrra-connect-errors
      record: connect_server_requests:burnrate3h
    - expr: sum by (method, service) (rate(connect_server_requests_total{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra"}[12h]))
        / sum by (method, service) (rate(connect_server_requests_total{job="pyrra"}[12h]))
      labels:
        job: pyrra
        slo: pyrra-connect-errors
      record: connect_server_requests:burnrate12h
    - expr: sum by (method, service) (rate(connect_server_requests_total{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra"}[2d]))
        / sum by (method, service) (rate(connect_server_requests_total{job="pyrra"}[2d]))
      labels:
        job: pyrra
        slo: pyrra-connect-errors
      record: connect_server_requests:burnrate2d
    - alert: ErrorBudgetBurn
      expr: connect_server_requests:burnrate3m{job="pyrra",slo="pyrra-connect-errors"}
        > (14 * (1-0.99)) and connect_server_requests:burnrate30m{job="pyrra",slo="pyrra-connect-errors"}
        > (14 * (1-0.99))
      for: 1m0s
      labels:
        exhaustion: 1d
        job: pyrra
        long: 30m
        severity: critical
        short: 3m
        slo: pyrra-connect-errors
    - alert: ErrorBudgetBurn
      expr: connect_server_requests:burnrate15m{job="pyrra",slo="pyrra-connect-errors"}
        > (7 * (1-0.99)) and connect_server_requests:burnrate3h{job="pyrra",slo="pyrra-connect-errors"}
        > (7 * (1-0.99))
      for: 8m0s
      labels:
        exhaustion: 2d
        job: pyrra
        long: 3h
        severity: critical
        short: 15m
        slo: pyrra-connect-errors
    - alert: ErrorBudgetBurn
      expr: connect_server_requests:burnrate1h{job="pyrra",slo="pyrra-connect-errors"}
        > (2 * (1-0.99)) and connect_server_requests:burnrate12h{job="pyrra",slo="pyrra-connect-errors"}
        > (2 * (1-0.99))
      for: 30m0s
      labels:
        exhaustion: 1w
        job: pyrra
        long: 12h
        severity: warning
        short: 1h
        slo: pyrra-connect-errors
    - alert: ErrorBudgetBurn
      expr: connect_server_requests:burnrate3h{job="pyrra",slo="pyrra-connect-errors"}
        > (1 * (1-0.99)) and connect_server_requests:burnrate2d{job="pyrra",slo="pyrra-connect-errors"}
        > (1 * (1-0.99))
      for: 1h30m0s
      labels:
        exhaustion: 2w
        job: pyrra
        long: 2d
        severity: warning
        short: 3h
        slo: pyrra-connect-errors

I can verify that the recording rules are created and contain data by querying them in Grafana:

Image

When I access the Pyrra UI main page that lists the SLOs, I see the example SLO as the only thing in the list but it says there is no data.

Image

If I click into the objective, I see incorrect or missing data:

Image Image

If I look at the logs for the pyrra-api pod, I can see it making the following queries, first for the main page:

ALERTS{slo=~".+"}

sum by (service, method) (connect_server_requests:increase2w{job="pyrra",slo="pyrra-connect-errors"})

sum by (service, method) (connect_server_requests:increase2w{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra",slo="pyrra-connect-errors"})

I don't see any data for the ALERTS query, but the other two return data just fine if I query them myself through Grafana.

For the objective-specific page, here are the queries logged:

((1 - 0.99) - (sum(connect_server_requests:increase2w{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra",slo="pyrra-connect-errors"} or vector(0)) / sum(connect_server_requests:increase2w{job="pyrra",slo="pyrra-connect-errors"}))) / (1 - 0.99)

sum by (code) (rate(connect_server_requests_total{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra"}[5m])) / scalar(sum(rate(connect_server_requests_total{job="pyrra"}[5m]))) > 0

sum by (code) (rate(connect_server_requests_total{job="pyrra"}[5m])) > 0

ALERTS{slo="pyrra-connect-errors"}

ALERTS{slo=~".+"}

sum by (service, method) (connect_server_requests:increase2w{job="pyrra",slo="pyrra-connect-errors"})

sum by (service, method) (connect_server_requests:increase2w{code=~"aborted|unavailable|internal|unknown|unimplemented|dataloss",job="pyrra",slo="pyrra-connect-errors"})

connect_server_requests:burnrate30m{job="pyrra",slo="pyrra-connect-errors"}

connect_server_requests:burnrate1h{job="pyrra",slo="pyrra-connect-errors"}

connect_server_requests:burnrate15m{job="pyrra",slo="pyrra-connect-errors"}

connect_server_requests:burnrate12h{job="pyrra",slo="pyrra-connect-errors"}

connect_server_requests:burnrate3h{job="pyrra",slo="pyrra-connect-errors"}

connect_server_requests:burnrate2d{job="pyrra",slo="pyrra-connect-errors"}

connect_server_requests:burnrate3m{job="pyrra",slo="pyrra-connect-errors"}

If I make these queries in Grafana, I see data for most or all of them. I'm just not seeing any data in the graphs in the Pyrra UI for the objective. Other things that seem strange in the UI:

  • the "Availability: Errors 0, Total 1" (I'd expect that the total would be related to the query connect_server_requests_total{job="pyrra"} defined in the SLO, which returns 166 requests across different services/methods when checked in Grafana).
  • the multirate burndown list showing "NaN" makes me think it's not getting the data I think it is from the above queries

For context, our setup:

  • We have a central Mimir deployment that replaces Prometheus
  • We have OpenTelemetry Collectors monitoring for Prometheus CRDs such as PodMonitors, ServiceMonitors, and PrometheusRules. These collectors are responsible for configuring and collecting everything, and we expect the common interface for services that expose metrics to be those Prometheus CRDs (no service or collector talks directly to Mimir natively)
  • We query and visualize our metrics in a Grafana instance backed to Mimir.
graph TD;
	otel-collector --> Mimir;
	PodMonitor --> otel-collector;
	ServiceMonitor --> otel-collector;
	PrometheusRule --> otel-collector;
	Pyrra --> ServiceMonitor;
	Pyrra --> ServiceLevelObjective;
	ServiceLevelObjective --> PrometheusRule;
	Mimir --> Grafana;
Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant