simplified pod resource metrics (support for initContainers) #71

rptaylor · 2024-09-10T19:28:44Z

Currently I believe initContainers will not be accounted or seen, since there are separate metrics like kube_pod_init_container_resource_requests for them: https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/workload/pod-metrics.md

It would be a hassle to write code to add up the initContainers, along with the regular containers, and do a max on them to try to duplicate/emulate the logic that k8s follows to determine the final pod resource amounts. It would be much better to simply query the actual resource amount of the whole pod, which is also recommended under the kube_pod_container_resource_requests description: "It is recommended to use the kube_pod_resource_requests metric exposed by kube-scheduler instead, as it is more precise."

Info about the kube-scheduler metrics: https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#kube-scheduler-metrics
I checked on one of our clusters and kube_pod_resource_requests was not available. I think those metrics may need to be enabled with the --show-hidden-metrics-for-version flag, or maybe Prom needs to be configured to scrape scheduler metrics. That would be an extra complication for deployment, but it's probably worth it to keep the code simpler and less bug prone, especially if support for accounting initContainers is needed.

The text was updated successfully, but these errors were encountered:

rptaylor · 2024-09-10T19:31:34Z

That being said https://kubernetes.io/docs/reference/instrumentation/metrics/ says that kube_pod_resource_request is a STABLE metric (even going back to v1.27) ...

Update: but this says, referring to kube_pod_resource_request,

The metrics are exposed at the HTTP endpoint /metrics/resources and require the same authorization as the /metrics endpoint on the scheduler. You must use the --show-hidden-metrics-for-version=1.20 flag to expose these alpha stability metrics.

mwestphall · 2024-09-12T21:26:52Z

@rptaylor I've also confirmed that kube_pod_resource_request isn't present on CHTC's v1.27 k8s cluster. To attempt to approximate that, do you think it would make sense to take the max of the union of the sum of the init_container and container resource requests?

max by (pod) (
    sum by (pod) (kube_pod_container_resource_requests{namespace="<ns>",resource="cpu"}) 
    OR 
    sum by (pod) (kube_pod_init_container_resource_requests{namespace="<ns>",resource="cpu"})
)

rptaylor · 2024-09-13T17:48:49Z

Maybe, but I'd like to at least investigate the alternative first (kube-scheduler metrics) to see how feasible it is, especially if the KSM developers specifically recommend that way and it would make our lives easier too.
I suspect the metrics are there but we just need to do something to see them and get them in Prometheus.

https://yuki-nakamura.com/2023/10/21/get-kube-schedulers-metrics-manually/

mwestphall · 2024-09-30T22:09:22Z

After some investigation, I've discovered that kube_pod_resource_request exists and can be integrated with Prometheus, though it is a bit tricky to get access to. Assuming the user's Prometheus instance is installed via the kube-prometheus-stack helm chart, the steps to integrate these metrics are roughly as follows:

Per the [get kube scheduler's metrics manually] article, ensure that kube scheduler's --bind-address is updated from 127.0.0.1 to 0.0.0.0 in the control plane config to allow incoming traffic from Prometheus.
```
sed -i 's/--bind-address=127.0.0.1/--bind-address=0.0.0.0/' /etc/kubernetes/manifests/kube-scheduler.yaml
```

The kube_pod_resource_request metric is exposed at the /metrics/resources route on the scheduler rather than /metrics. The kube-prometheus-stack helm chart only scrapes /metrics on the scheduler by default, so a new scrape config must be added in the chart's values.yaml:

prometheus:
  ...
  prometheusSpec:
    ...
    additionalScrapeConfigs:
    - job_name: serviceMonitor/default/prometheus-stack-kube-prom-kube-scheduler/1
      honor_timestamps: true
      track_timestamps_staleness: false
      scrape_interval: 30s
      scrape_timeout: 10s
      scrape_protocols:
      - OpenMetricsText1.0.0
      - OpenMetricsText0.0.1
      - PrometheusText0.0.4
      metrics_path: /metrics/resources
      scheme: https
      enable_compression: true
      authorization:
        type: Bearer
        credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      follow_redirects: true
      enable_http2: true
      relabel_configs:
      - ...
      kubernetes_sd_configs:
      - role: endpoints
        kubeconfig_file: ""
        follow_redirects: true
        enable_http2: true
        namespaces:
          own_namespace: false
          names:
          - kube-system

A new ClusterRole and ClusterRole binding must also be created, as the role created by the helm chart only grants prometheus access to the /metrics endpoint by default:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prom-metrics-resources-role
apiVersion: rbac.authorization.k8s.io/v1
rules:
  - nonResourceURLs:
      - "/metrics/resources"
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prom-metrics-resources-role-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prom-metrics-resources-role
subjects:
- kind: ServiceAccount
  name: prometheus-stack-kube-prom-prometheus
  namespace: default

rptaylor · 2024-12-19T20:52:25Z

@mwestphall That's great to hear, excellent investigation! I meant to reply earlier but I was trying to get the kube-scheduler metrics working in our cluster but got stuck. I think we are planning to move towards the kube-prometheus-stack instead of the Bitnami chart so hopefully we will be able to converge and build on your findings then. (Also investigating the possibility of VictoriaMetrics but that could involve more compatibility challenges).

mwestphall · 2024-12-20T15:34:11Z

@rptaylor I'm about to head out for the holidays, let's plan on continuing this work at the start of the new year.

mwestphall · 2025-01-02T15:59:46Z

@rptaylor I'm back in the office now. In terms of next steps here, would it make sense to set up a test cluster that exposes kube_pod_resource_request on the UW side and update the chart code to query that? We'd be a bit hesitant to update our main cluster given the need to edit the scheduler config but should be able to get a smaller dedicated cluster going.

rptaylor · 2025-01-10T19:38:02Z

@mwestphall happy new year and sorry for the delayed response.
Yes I think so. My reservation was mainly that getting the scheduler metrics working could be an additional difficulty in some environments (like evidently in our current clusters) that would present a barrier to users adopting kuantifier, and we wouldn't be able to test/use the new method at the moment in our current clusters. However:

We decided to move from bitnami's kube-prometheus to kube-prometheus-stack
We should add a config flag like "resource metric type" with a description along the lines of "whether to use container-based or pod-based resource metrics", and document that using the pod-based method enables support for accounting of initcontainers, and requires exposing the scheduler metrics. Probably the default value should be container-based to preserve current behaviour, but ideally pod-based would become the preferred/recommended way in environments where scheduler metrics are available. The PromQL queries are just (long, complex) strings but we can use different queries in different modes based on the config flag. Hopefully it would not be too complex to maintain both querying methods.
We (possibly me if I get a chance) could perhaps contribute a MR to the kube-prometheus-stack helm chart that adds an option to facilitate gathering the scheduler metrics based on the config requirements you worked out.

What do you think, does that make sense?

In the end it would be similar to what you suggested here: #71 (comment) but the result would be more controllable/configurable.

But given the logic involved, and in particular that the effective resource requests are different than the declared resource requests, ultimately using the pod-based metrics from the scheduler is the only way to be correct in more complex situations such as initcontainers.

mwestphall · 2025-01-10T22:02:08Z

@rptaylor that sounds like a good plan, I will get started on that beginning of next week. With regards to reservations about getting scheduler metrics working, configuration on the Prometheus side should be easy enough so long as we properly document it. I think there's probably still a couple concerns with needing to edit the scheduler bind address to make the relevant metrics endpoint accessible in the first place, since by my understanding this is disabled by default in most kubernetes set ups.

This was referenced Jan 2, 2025

investigate reporting of initcontainers #55

Closed

Generalize queries to work if there is more than just 1 container in a pod #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simplified pod resource metrics (support for initContainers) #71

simplified pod resource metrics (support for initContainers) #71

rptaylor commented Sep 10, 2024

rptaylor commented Sep 10, 2024 •

edited

Loading

mwestphall commented Sep 12, 2024 •

edited

Loading

rptaylor commented Sep 13, 2024

mwestphall commented Sep 30, 2024

rptaylor commented Dec 19, 2024

mwestphall commented Dec 20, 2024

mwestphall commented Jan 2, 2025

rptaylor commented Jan 10, 2025 •

edited

Loading

mwestphall commented Jan 10, 2025

simplified pod resource metrics (support for initContainers) #71

simplified pod resource metrics (support for initContainers) #71

Comments

rptaylor commented Sep 10, 2024

rptaylor commented Sep 10, 2024 • edited Loading

mwestphall commented Sep 12, 2024 • edited Loading

rptaylor commented Sep 13, 2024

mwestphall commented Sep 30, 2024

rptaylor commented Dec 19, 2024

mwestphall commented Dec 20, 2024

mwestphall commented Jan 2, 2025

rptaylor commented Jan 10, 2025 • edited Loading

mwestphall commented Jan 10, 2025

rptaylor commented Sep 10, 2024 •

edited

Loading

mwestphall commented Sep 12, 2024 •

edited

Loading

rptaylor commented Jan 10, 2025 •

edited

Loading