Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: add http health probe #21

Open
dhkron opened this issue Dec 31, 2020 · 4 comments
Open

Feature: add http health probe #21

dhkron opened this issue Dec 31, 2020 · 4 comments

Comments

@dhkron
Copy link

dhkron commented Dec 31, 2020

Hey,
I suggest adding an HTTP server to supervisor, for the purpose of being able to monitor the health status of the worker pool.
An http request will return a status of currently running workers, and perhaps supervisor metadata.
It’s possible to incorporate query string params for what’s considered “healthy” supervisor: I.e no more than N workers are silent for more than T seconds.

@wavenator
Copy link
Contributor

We need to think about when a Supervisor is considered healthy. At the minimal level, a responding Supervisor is healthy, yet it does not imply anything about the status of its sub-workers. How would you tell if a sub-worker is healthy. It has no communication with the Supervisor unless it finished its pile of tasks.

About the implementation suggestion, I know k8s supports HTTP health-checks but I'm not sure this is a good way to go here. Implementing an HTTP server for the purpose of health-checks sounds to me like overkill. I'd explore a TCP Asyncio server first. HTTP server brings more overhead than I'm willing to pay here and they both bring the same effect. We can also consider using a file-based probe.

Anyway, we must define "Healthy" before we proceed with either of these.

@dhkron
Copy link
Author

dhkron commented Jan 3, 2021

For the "healthy" definition, I'd consider "X out of Y workers have polled messages in the last T seconds".
Each worker T is different, for some workers take few seconds and some take few minutes.
The amount of silent workers X can also be customized.

As for TCP/File based - the upside of HTTP health probe is being able to take parameters in a simple manner, as simple as k8s's httpGet health probe yaml block.
That way, the HTTP server does not need to consider whether his workers are healthy or not - he gets X & T as parameters and then can give the healthiness result based on X & T criteria.
Doing this on File-based health probe is impossible, and TCP is possible, yet more tricky.
However, since Supervisor serves as a middle layer, X & T can be part of his own config, and then TCP & File can be used.

@wavenator
Copy link
Contributor

I think a health check response should return a boolean result, whether the service is healthy or not. The parameters defining healthiness should be part of the worker's configuration. I'm not sure about the availability of such a solution though. The interaction between workers and their supervisors is much more primitive than you would imagine. Their only interaction happens when the worker reached its max_tasks_per_run. We should also think about a starving worker, not consuming tasks because there are no tasks to consume, and not due to a healthiness problem. We should think about the edge-cases here. I think implementing a health check that merely indicates the supervisor is alive and responding is enough. Indicating the workers' healthiness should be discussed thoroughly to develop a lightweight and precise solution that is not prone to false positives.

@dhkron
Copy link
Author

dhkron commented Jan 3, 2021

I think a good-enough approach for workers healthiness is whether or not they look in the queue.
Making the thresholds part of the request will allow a single & simple server to respond accordingly, and the workers will remain simple - only logging their last poll time.
In addition, a boolean result is great - could be status code 200 & 500 - but you could also add statistics to the response, if they are available.
All you need in this approach is that the workers log their last poll time in a shared memory with the supervisor, a relatively small change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants