-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose metrics endpoint #486
Conversation
5cf0992
to
4d6bf26
Compare
@dimakis,
|
@dimakis We need to make sure that a build passes, can you please check why are the builds failing? |
4bcf9e1
to
c09181f
Compare
c09181f
to
0db79d4
Compare
/lgtm I will approve this PR once @astefanutti has a chance to chime in. |
Makefile
Outdated
echo "Host architecture: $(shell uname -m)" | ||
@HOST_ARCH=$(shell uname -m); \ | ||
if [ "$$HOST_ARCH" = "arm64" ]; then \ | ||
if [ "$(strip $(GO_BUILD_ARGS))" = "" ]; then \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this get moved to a separate PR? It's not strictly related to this PR, and we would be able to iterate on that part without blocking the work on the metrics part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right here, to alter this is the only way I could test my changes, hence in this PR but I can move it now 👍
cmd/kar-controllers/app/server.go
Outdated
RecordMetrics() | ||
|
||
metricsHandler := http.NewServeMux() | ||
metricsHandler.Handle("/metrics", promhttp.Handler()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add the default Prometheus error handler (ErrorHandling: promhttp.HTTPErrorOnError
)
0db79d4
to
cb5f178
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
e87fbfa
to
96ffc09
Compare
New changes are detected. LGTM label has been removed. |
/retest |
@dimakis could you squash the refactor / review commits? |
@dimakis The build has been consistently failing on the quota management step 3 test, my guess is in server.go or during MCAD startup we are failing to init quota trees for this PR. |
I'm not sure I know how to progress this, I'm not familiar with the quota aspect of MCAD. can I have some help here @asm582 ? |
…rently this commit also edits the startHealthServer to also start collecting of default prometheus metrics. channels are used to collect potential errors and allow for graceful shutdown of servers
…shutdown once jobctrl is shut down
…rver before controller
…er minor edits from reviews
4c6b5fb
to
6a0e175
Compare
wg.Add(1) | ||
g.Go(func() error { | ||
defer wg.Done() | ||
jobctrl.Run(gCtx.Done()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dimakis,
The jobctrl.Run
function is asysnchrouous, it will fire off a few deamon
goroutines and return, thusly triggering the defer call and efectively shutting down the process. (the container is constant crashbackoff loop). I would suggest making the jobctrl.Run be synchrounous or revisiting this implementation.
cc: @astefanutti
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@z103cb thanks, very good catch. That Go routine should block for the group context channel to close.
@dimakis @z103cb this PR has become obsolete in the context of project-codeflare/codeflare-operator#279 and should be closed. The metrics registration should moved to the CFO once project-codeflare/codeflare-operator#216 is merged, and be done using the controller-runtime registry. |
Using port 9090 , the conventional prom port.
this PR in odh-deployer will need to be updated also, I will get to it asap