-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial metrics #651
base: main
Are you sure you want to change the base?
Add initial metrics #651
Conversation
Add dependency: sigs.k8s.io/controller-runtime v0.14.6 go get sigs.k8s.io/[email protected] go mod tidy
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
pkg/controller/queuejob/metrics.go
Outdated
allocatableCapacityCpu := prometheus.NewGaugeFunc(prometheus.GaugeOpts{ | ||
Subsystem: "mcad", | ||
Name: "allocatable_capacity_cpu", | ||
Help: "Allocatable CPU Capacity (in milicores)", | ||
}, func() float64 { return controller.GetAllocatableCapacity().MilliCPU }) | ||
metrics.Registry.MustRegister(allocatableCapacityCpu) | ||
|
||
allocatableCapacityMemory := prometheus.NewGaugeFunc(prometheus.GaugeOpts{ | ||
Subsystem: "mcad", | ||
Name: "allocatable_capacity_memory", | ||
Help: "Allocatable Memory Capacity", | ||
}, func() float64 { return controller.GetAllocatableCapacity().Memory }) | ||
metrics.Registry.MustRegister(allocatableCapacityMemory) | ||
|
||
allocatableCapacityGpu := prometheus.NewGaugeFunc(prometheus.GaugeOpts{ | ||
Subsystem: "mcad", | ||
Name: "allocatable_capacity_gpu", | ||
Help: "Allocatable GPU Capacity", | ||
}, func() float64 { return float64(controller.GetAllocatableCapacity().GPU) }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calls to GetAllocatableCapacity()
are quite expensive to be executed in GaugeFuncs. Should I call it periodically and cache the values like in the original PR?
https://github.com/project-codeflare/multi-cluster-app-dispatcher/pull/573/files#diff-cd034f141bebdabd3ed7e6118e9581ec4646707beaf8438f164ae51f382f2c1dR69-R77
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe caching the values like in the original PR is the right call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, listing all the Nodes and all the Pods clearly does not scale at all.
We need to reconsider re-activating / improving the cache, intersecting #588. That may provide a smooth transition once we migrate over to controller-runtime.
Otherwise, we can consider a simple caching mechanism, e.g. that would rate limit the call to allocatableCapacity
using something like golang.org/x/time/rate.NewLimiter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made another commit to schedule the expensive call to run once a minute. I wasn't sure how to use golang.org/x/time/rate.NewLimiter
without making it look like an overkill.
Nice one! I just followed the verification steps, it works and shows the CPU, GPU, and memory allocatable capacity 👍. Something to note:
Does this need to be changed over in the codeflare-operator? |
As commented in the mentioned PR, I'm not sure if this would require an ADR as well. cc: @dimakis @anishasthana @astefanutti |
I think we would want to be exposing on port |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChristianZaccaria Thanks for reviewing and trying this up! :)
Nice one! I just followed the verification steps, it works and shows the CPU, GPU, and memory allocatable capacity 👍. Something to note:
* I had to change the last bit of [this line (on the operator)](https://github.com/project-codeflare/codeflare-operator/blob/c3383c8e19a8a8c94169c218f1ed8c9faa8e12d5/main.go#L30) to `v1alpha1` as I was getting the following error after running `make build`:
github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/apis/quotaplugins/quotasubtree/v1: module github.com/project-codeflare/multi-cluster-app-dispatcher@latest found (v1.35.0, replaced by ../multi-cluster-app-dispatcher), but does not contain package github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/apis/quotaplugins/quotasubtree/v1
Does this need to be changed over in the codeflare-operator?
Good catch! I missed this bit of change I had to make too. I think this is due to #650 and once there is a new relase of MCAD, this change in the operator is mandatory.
I think we would want to be exposing on port
8083
as done for the original PR
IIUC, after the redesign, the exposing port is the business of the operator. Its default value is set here
https://github.com/project-codeflare/codeflare-operator/blob/99d2fc8b4e3bdcada5259f4417a8c2b2ca342430/main.go#L98
A different value can be set by setting metrics.bindAddress
in codeflare-operator-config
config map
That's correct. MCAD is only responsible for the instrumentation and the registration of the metrics. Serving these metrics is the responsibility of the CodeFlare operator. |
@ChristianZaccaria any conclusion on the necessity of an ADR? I started working on it. |
@ChristianZaccaria @astefanutti Is there anything I can do to make progress on this PR? |
Issue link
What changes have been made
This is another take on #573 to add initial metrics considering the redesign of codeflare-operator.
Verification steps
I checked that the metrics are available in the codeflare-operator operator by running it:
and in a new shell:
Checks