Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplified pod resource metrics (support for initContainers) #71

Open
rptaylor opened this issue Sep 10, 2024 · 9 comments
Open

simplified pod resource metrics (support for initContainers) #71

rptaylor opened this issue Sep 10, 2024 · 9 comments

Comments

@rptaylor
Copy link
Owner

Currently I believe initContainers will not be accounted or seen, since there are separate metrics like kube_pod_init_container_resource_requests for them: https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/workload/pod-metrics.md

It would be a hassle to write code to add up the initContainers, along with the regular containers, and do a max on them to try to duplicate/emulate the logic that k8s follows to determine the final pod resource amounts. It would be much better to simply query the actual resource amount of the whole pod, which is also recommended under the kube_pod_container_resource_requests description: "It is recommended to use the kube_pod_resource_requests metric exposed by kube-scheduler instead, as it is more precise."

Info about the kube-scheduler metrics: https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#kube-scheduler-metrics
I checked on one of our clusters and kube_pod_resource_requests was not available. I think those metrics may need to be enabled with the --show-hidden-metrics-for-version flag, or maybe Prom needs to be configured to scrape scheduler metrics. That would be an extra complication for deployment, but it's probably worth it to keep the code simpler and less bug prone, especially if support for accounting initContainers is needed.

@rptaylor
Copy link
Owner Author

rptaylor commented Sep 10, 2024

That being said https://kubernetes.io/docs/reference/instrumentation/metrics/ says that kube_pod_resource_request is a STABLE metric (even going back to v1.27) ...

Update: but this says, referring to kube_pod_resource_request,

The metrics are exposed at the HTTP endpoint /metrics/resources and require the same authorization as the /metrics endpoint on the scheduler. You must use the --show-hidden-metrics-for-version=1.20 flag to expose these alpha stability metrics.

@mwestphall
Copy link
Collaborator

mwestphall commented Sep 12, 2024

@rptaylor I've also confirmed that kube_pod_resource_request isn't present on CHTC's v1.27 k8s cluster. To attempt to approximate that, do you think it would make sense to take the max of the union of the sum of the init_container and container resource requests?

max by (pod) (
    sum by (pod) (kube_pod_container_resource_requests{namespace="<ns>",resource="cpu"}) 
    OR 
    sum by (pod) (kube_pod_init_container_resource_requests{namespace="<ns>",resource="cpu"})
)

@rptaylor
Copy link
Owner Author

Maybe, but I'd like to at least investigate the alternative first (kube-scheduler metrics) to see how feasible it is, especially if the KSM developers specifically recommend that way and it would make our lives easier too.
I suspect the metrics are there but we just need to do something to see them and get them in Prometheus.

https://yuki-nakamura.com/2023/10/21/get-kube-schedulers-metrics-manually/

@mwestphall
Copy link
Collaborator

After some investigation, I've discovered that kube_pod_resource_request exists and can be integrated with Prometheus, though it is a bit tricky to get access to. Assuming the user's Prometheus instance is installed via the kube-prometheus-stack helm chart, the steps to integrate these metrics are roughly as follows:

  • Per the [get kube scheduler's metrics manually] article, ensure that kube scheduler's --bind-address is updated from 127.0.0.1 to 0.0.0.0 in the control plane config to allow incoming traffic from Prometheus.

    sed -i 's/--bind-address=127.0.0.1/--bind-address=0.0.0.0/' /etc/kubernetes/manifests/kube-scheduler.yaml
    
  • The kube_pod_resource_request metric is exposed at the /metrics/resources route on the scheduler rather than /metrics. The kube-prometheus-stack helm chart only scrapes /metrics on the scheduler by default, so a new scrape config must be added in the chart's values.yaml:

    prometheus:
      ...
      prometheusSpec:
        ...
        additionalScrapeConfigs:
        - job_name: serviceMonitor/default/prometheus-stack-kube-prom-kube-scheduler/1
          honor_timestamps: true
          track_timestamps_staleness: false
          scrape_interval: 30s
          scrape_timeout: 10s
          scrape_protocols:
          - OpenMetricsText1.0.0
          - OpenMetricsText0.0.1
          - PrometheusText0.0.4
          metrics_path: /metrics/resources
          scheme: https
          enable_compression: true
          authorization:
            type: Bearer
            credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            insecure_skip_verify: true
          follow_redirects: true
          enable_http2: true
          relabel_configs:
          - ...
          kubernetes_sd_configs:
          - role: endpoints
            kubeconfig_file: ""
            follow_redirects: true
            enable_http2: true
            namespaces:
              own_namespace: false
              names:
              - kube-system
    
  • A new ClusterRole and ClusterRole binding must also be created, as the role created by the helm chart only grants prometheus access to the /metrics endpoint by default:

    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: prom-metrics-resources-role
    apiVersion: rbac.authorization.k8s.io/v1
    rules:
      - nonResourceURLs:
          - "/metrics/resources"
        verbs: ["get"]
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: prom-metrics-resources-role-binding
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: prom-metrics-resources-role
    subjects:
    - kind: ServiceAccount
      name: prometheus-stack-kube-prom-prometheus
      namespace: default
    

@rptaylor
Copy link
Owner Author

@mwestphall That's great to hear, excellent investigation! I meant to reply earlier but I was trying to get the kube-scheduler metrics working in our cluster but got stuck. I think we are planning to move towards the kube-prometheus-stack instead of the Bitnami chart so hopefully we will be able to converge and build on your findings then. (Also investigating the possibility of VictoriaMetrics but that could involve more compatibility challenges).

@mwestphall
Copy link
Collaborator

@rptaylor I'm about to head out for the holidays, let's plan on continuing this work at the start of the new year.

@mwestphall
Copy link
Collaborator

@rptaylor I'm back in the office now. In terms of next steps here, would it make sense to set up a test cluster that exposes kube_pod_resource_request on the UW side and update the chart code to query that? We'd be a bit hesitant to update our main cluster given the need to edit the scheduler config but should be able to get a smaller dedicated cluster going.

@rptaylor
Copy link
Owner Author

rptaylor commented Jan 10, 2025

@mwestphall happy new year and sorry for the delayed response.
Yes I think so. My reservation was mainly that getting the scheduler metrics working could be an additional difficulty in some environments (like evidently in our current clusters) that would present a barrier to users adopting kuantifier, and we wouldn't be able to test/use the new method at the moment in our current clusters. However:

  • We decided to move from bitnami's kube-prometheus to kube-prometheus-stack
  • We should add a config flag like "resource metric type" with a description along the lines of "whether to use container-based or pod-based resource metrics", and document that using the pod-based method enables support for accounting of initcontainers, and requires exposing the scheduler metrics. Probably the default value should be container-based to preserve current behaviour, but ideally pod-based would become the preferred/recommended way in environments where scheduler metrics are available. The PromQL queries are just (long, complex) strings but we can use different queries in different modes based on the config flag. Hopefully it would not be too complex to maintain both querying methods.
  • We (possibly me if I get a chance) could perhaps contribute a MR to the kube-prometheus-stack helm chart that adds an option to facilitate gathering the scheduler metrics based on the config requirements you worked out.

What do you think, does that make sense?

In the end it would be similar to what you suggested here: #71 (comment) but the result would be more controllable/configurable.

But given the logic involved, and in particular that the effective resource requests are different than the declared resource requests, ultimately using the pod-based metrics from the scheduler is the only way to be correct in more complex situations such as initcontainers.

@mwestphall
Copy link
Collaborator

@rptaylor that sounds like a good plan, I will get started on that beginning of next week. With regards to reservations about getting scheduler metrics working, configuration on the Prometheus side should be easy enough so long as we properly document it. I think there's probably still a couple concerns with needing to edit the scheduler bind address to make the relevant metrics endpoint accessible in the first place, since by my understanding this is disabled by default in most kubernetes set ups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants