Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose metrics endpoint #486

Closed
wants to merge 19 commits into from
Closed

Expose metrics endpoint #486

wants to merge 19 commits into from

Conversation

dimakis
Copy link
Contributor

@dimakis dimakis commented Jul 18, 2023

Using port 9090 , the conventional prom port.

this PR in odh-deployer will need to be updated also, I will get to it asap

Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
@z103cb
Copy link
Contributor

z103cb commented Jul 18, 2023

@dimakis,
You might want to take a look at the error message reported by the e2e test:

kubectl kuttl test --config /home/travis/gopath/src/github.com/project-codeflare/multi-cluster-app-dispatcher/test/kuttl-test.yaml
=== RUN   kuttl
    harness.go:462: starting setup
    harness.go:252: running tests using configured kubeconfig.
    harness.go:275: Successful connection to cluster at: https://127.0.0.1:46073
    logger.go:42: 03:12:58 |  | running command: [sh -c helm upgrade  --install mcad-controller deployment/mcad-controller --namespace kube-system --wait --set loglevel=${LOG_LEVEL} --set resources.requests.cpu=1000m --set resources.requests.memory=1024Mi --set resources.limits.cpu=4000m --set resources.limits.memory=4096Mi --set image.repository=$IMAGE_REPOSITORY_MCAD --set image.tag=$IMAGE_TAG_MCAD --set image.pullPolicy=$MCAD_IMAGE_PULL_POLICY --set configMap.quotaEnabled='"true"' --set quotaManagement.rbac.apiGroup=ibm.com --set quotaManagement.rbac.resource=quotasubtrees  --set configMap.name=mcad-controller-configmap --set configMap.preemptionEnabled='"true"']
    logger.go:42: 03:12:58 |  | WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /home/travis/gopath/src/github.com/project-codeflare/multi-cluster-app-dispatcher/kubeconfig
    logger.go:42: 03:12:58 |  | WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /home/travis/gopath/src/github.com/project-codeflare/multi-cluster-app-dispatcher/kubeconfig
    logger.go:42: 03:12:59 |  | Release "mcad-controller" does not exist. Installing it now.
    logger.go:42: 03:13:02 |  | Error: unable to build kubernetes objects from release manifest: error validating "": error validating data: ValidationError(Deployment.spec.template.spec.containers[0]): unknown field "metrics" in io.k8s.api.core.v1.Container
    logger.go:42: 03:13:02 |  | command failure, skipping 1 additional commands
    harness.go:513: cleaning up
    harness.go:570: removing temp folder: ""
    harness.go:595: fatal error running commands: exit status 1

@asm582
Copy link
Member

asm582 commented Jul 21, 2023

@dimakis We need to make sure that a build passes, can you please check why are the builds failing?

@z103cb
Copy link
Contributor

z103cb commented Aug 3, 2023

/lgtm

I will approve this PR once @astefanutti has a chance to chime in.

cmd/kar-controllers/app/metrics.go Outdated Show resolved Hide resolved
Makefile Outdated
echo "Host architecture: $(shell uname -m)"
@HOST_ARCH=$(shell uname -m); \
if [ "$$HOST_ARCH" = "arm64" ]; then \
if [ "$(strip $(GO_BUILD_ARGS))" = "" ]; then \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this get moved to a separate PR? It's not strictly related to this PR, and we would be able to iterate on that part without blocking the work on the metrics part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right here, to alter this is the only way I could test my changes, hence in this PR but I can move it now 👍

cmd/kar-controllers/app/options/options.go Outdated Show resolved Hide resolved
RecordMetrics()

metricsHandler := http.NewServeMux()
metricsHandler.Handle("/metrics", promhttp.Handler())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add the default Prometheus error handler (ErrorHandling: promhttp.HTTPErrorOnError)

cmd/kar-controllers/app/server.go Outdated Show resolved Hide resolved
@openshift-ci
Copy link

openshift-ci bot commented Aug 9, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from astefanutti. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot removed the lgtm label Aug 29, 2023
@openshift-ci
Copy link

openshift-ci bot commented Aug 29, 2023

New changes are detected. LGTM label has been removed.

@astefanutti
Copy link
Contributor

/retest

@astefanutti
Copy link
Contributor

@dimakis could you squash the refactor / review commits?

@asm582
Copy link
Member

asm582 commented Aug 29, 2023

@dimakis The build has been consistently failing on the quota management step 3 test, my guess is in server.go or during MCAD startup we are failing to init quota trees for this PR.

@dimakis
Copy link
Contributor Author

dimakis commented Sep 4, 2023

@dimakis The build has been consistently failing on the quota management step 3 test, my guess is in server.go or during MCAD startup we are failing to init quota trees for this PR.

I'm not sure I know how to progress this, I'm not familiar with the quota aspect of MCAD. can I have some help here @asm582 ?

…rently

this commit also edits the startHealthServer to also start collecting of default prometheus metrics. channels are used to collect potential errors and allow for graceful shutdown of servers
wg.Add(1)
g.Go(func() error {
defer wg.Done()
jobctrl.Run(gCtx.Done())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimakis,
The jobctrl.Run function is asysnchrouous, it will fire off a few deamon goroutines and return, thusly triggering the defer call and efectively shutting down the process. (the container is constant crashbackoff loop). I would suggest making the jobctrl.Run be synchrounous or revisiting this implementation.

cc: @astefanutti

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I see when running the controller:

Screenshot 2023-09-19 at 09 52 46

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@z103cb thanks, very good catch. That Go routine should block for the group context channel to close.

@astefanutti
Copy link
Contributor

@dimakis @z103cb this PR has become obsolete in the context of project-codeflare/codeflare-operator#279 and should be closed. The metrics registration should moved to the CFO once project-codeflare/codeflare-operator#216 is merged, and be done using the controller-runtime registry.

@dimakis dimakis closed this Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants