-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: add http health probe #21
Comments
We need to think about when a Supervisor is considered healthy. At the minimal level, a responding Supervisor is healthy, yet it does not imply anything about the status of its sub-workers. How would you tell if a sub-worker is healthy. It has no communication with the Supervisor unless it finished its pile of tasks. About the implementation suggestion, I know k8s supports HTTP health-checks but I'm not sure this is a good way to go here. Implementing an HTTP server for the purpose of health-checks sounds to me like overkill. I'd explore a TCP Asyncio server first. HTTP server brings more overhead than I'm willing to pay here and they both bring the same effect. We can also consider using a file-based probe. Anyway, we must define "Healthy" before we proceed with either of these. |
For the "healthy" definition, I'd consider "X out of Y workers have polled messages in the last T seconds". As for TCP/File based - the upside of HTTP health probe is being able to take parameters in a simple manner, as simple as k8s's |
I think a health check response should return a boolean result, whether the service is healthy or not. The parameters defining healthiness should be part of the worker's configuration. I'm not sure about the availability of such a solution though. The interaction between workers and their supervisors is much more primitive than you would imagine. Their only interaction happens when the worker reached its max_tasks_per_run. We should also think about a starving worker, not consuming tasks because there are no tasks to consume, and not due to a healthiness problem. We should think about the edge-cases here. I think implementing a health check that merely indicates the supervisor is alive and responding is enough. Indicating the workers' healthiness should be discussed thoroughly to develop a lightweight and precise solution that is not prone to false positives. |
I think a good-enough approach for workers healthiness is whether or not they look in the queue. |
Hey,
I suggest adding an HTTP server to supervisor, for the purpose of being able to monitor the health status of the worker pool.
An http request will return a status of currently running workers, and perhaps supervisor metadata.
It’s possible to incorporate query string params for what’s considered “healthy” supervisor: I.e no more than N workers are silent for more than T seconds.
The text was updated successfully, but these errors were encountered: