-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for Service Level Objectives in witchcraft #36
base: develop
Are you sure you want to change the base?
Conversation
One semi-related thing here is I found it awkward to have to add the health check source in the server builder instead of being able to add it during the initialization function. Is there a better way of doing this today? |
You can add health/readiness/liveness sources either in the builder OR in the init function -- the
This lets you register health, readiness and liveness sources in the init function as opposed to at construction time if you prefer to do so. |
Taking a look at the broader PR now -- will focus on high-level feedback rather than exact code content (can review content once we ensure that we have agreement on top-level). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look and I understand the intent -- definitely on-board with it generally.
I don't have a concrete suggestion for how to do this yet, but having to duplicate the endpoint template (or name) in registering the endpoint with an SLO is a bit of a bummer, as it leads to a lot of the same constants recurring. Here's example registration code when using wresource
(which is pretty common):
res := wresource.New("installEndpoints", router)
return res.Get("installNum", "/installNum", slos.Register("installNum", func(rw http.ResponseWriter, req *http.Request) {
rest.WriteJSONResponse(rw, num, http.StatusOK)
}, time.Millisecond*100, time.Millisecond*200, time.Millisecond*300))
In this code, "installNum" is the endpointName and SLO name, and then we also have /installNum
as the path. This makes the process of registering an endpoint pretty verbose, although I don't have a great suggestion for an alternative.
As a bigger-picture question, I wonder if this is the best approach to this -- this approach requires all products to update their code and provide concrete time-based estimates for endpoints on a per-endpoint basis ahead of time. Since we have request logs for endpoints (which use the endpoint template), it seems like this could be done for products as log analysis rather than as a runtime check. Not sure if that's realistic, but I think it's worth at least thinking through.
Reviewable status: 0 of 4 files reviewed, all discussions resolved (waiting on @bmoylan, @ellisjoe, and @nmiyake)
The problem with the request log based approach is that nobody actively does it. While this has been around forever, and devs have also had endpoint specific duration metrics (which you could create the p95, p99, p999 alerts on), it requires someone to go actively create those monitors. The value in doing this in the health endpoint is that it gets auto-wired into our existing monitoring framework. Thoughts on a path forward? I can move this to an external repo if you think it would be better there until we find a cleaner way of integrating it. It also probably doesn't hurt to have it in witchcraft-go-server and let people opt-in to it. |
OK yeah I'm not opposed to exposing this, and don't have major concerns since this doesn't introduce any new dependencies. The one thing that I still find a bit strange is that this approach requires products to declare their expected runtime performance thresholds as part fo the compiled output. Are we going to provide any guidance to developers about how they should properly choose these values for given endpoints, or do we anticipate this as an after-the-fact thing where we monitor the values in production and then have products re-update with thresholds based on real-world data? I'm aligned on the goals here, but want to make sure we have a story on how we expect developers to actually make use of this in practice. Once we determine that we're in agreement that we should implement this (and I think I'm close -- would just like some clarification/thoughts on the above), then I'll move to review the contents of the PR and we can merge on approval. |
I think there are two ways in which I'd expect a dev to approach using SLOs:
I also suspect there will be a set of endpoints where it doesn't make sense to define an SLO, as the duration is highly coupled to the nature of the request (anything that has a highly variable request or response body). |
Is this going to be limited to endpoints? In Java you can add it to arbitrary methods, which is very useful for being able to monitor and prevent regressions on some functions. It will definitely not encapsulate dependencies affecting your processing very well (which a problem generic to SLOs). |
The goal of this PR is to introduce an easy way of adding service level objectives (SLOs) to witchcraft, and have the SLO status reflected via a health check. I'm looking for feedback on the implementation
of the SLO package and thoughts on the right way of integrating them with the route registration process.
This change is