Feature: add http health probe #21

dhkron · 2020-12-31T23:15:41Z

Hey,
I suggest adding an HTTP server to supervisor, for the purpose of being able to monitor the health status of the worker pool.
An http request will return a status of currently running workers, and perhaps supervisor metadata.
It’s possible to incorporate query string params for what’s considered “healthy” supervisor: I.e no more than N workers are silent for more than T seconds.

wavenator · 2021-01-03T09:33:36Z

We need to think about when a Supervisor is considered healthy. At the minimal level, a responding Supervisor is healthy, yet it does not imply anything about the status of its sub-workers. How would you tell if a sub-worker is healthy. It has no communication with the Supervisor unless it finished its pile of tasks.

About the implementation suggestion, I know k8s supports HTTP health-checks but I'm not sure this is a good way to go here. Implementing an HTTP server for the purpose of health-checks sounds to me like overkill. I'd explore a TCP Asyncio server first. HTTP server brings more overhead than I'm willing to pay here and they both bring the same effect. We can also consider using a file-based probe.

Anyway, we must define "Healthy" before we proceed with either of these.

dhkron · 2021-01-03T09:41:06Z

For the "healthy" definition, I'd consider "X out of Y workers have polled messages in the last T seconds".
Each worker T is different, for some workers take few seconds and some take few minutes.
The amount of silent workers X can also be customized.

As for TCP/File based - the upside of HTTP health probe is being able to take parameters in a simple manner, as simple as k8s's httpGet health probe yaml block.
That way, the HTTP server does not need to consider whether his workers are healthy or not - he gets X & T as parameters and then can give the healthiness result based on X & T criteria.
Doing this on File-based health probe is impossible, and TCP is possible, yet more tricky.
However, since Supervisor serves as a middle layer, X & T can be part of his own config, and then TCP & File can be used.

wavenator · 2021-01-03T09:53:44Z

I think a health check response should return a boolean result, whether the service is healthy or not. The parameters defining healthiness should be part of the worker's configuration. I'm not sure about the availability of such a solution though. The interaction between workers and their supervisors is much more primitive than you would imagine. Their only interaction happens when the worker reached its max_tasks_per_run. We should also think about a starving worker, not consuming tasks because there are no tasks to consume, and not due to a healthiness problem. We should think about the edge-cases here. I think implementing a health check that merely indicates the supervisor is alive and responding is enough. Indicating the workers' healthiness should be discussed thoroughly to develop a lightweight and precise solution that is not prone to false positives.

dhkron · 2021-01-03T14:16:06Z

I think a good-enough approach for workers healthiness is whether or not they look in the queue.
Making the thresholds part of the request will allow a single & simple server to respond accordingly, and the workers will remain simple - only logging their last poll time.
In addition, a boolean result is great - could be status code 200 & 500 - but you could also add statistics to the response, if they are available.
All you need in this approach is that the workers log their last poll time in a shared memory with the supervisor, a relatively small change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: add http health probe #21

Feature: add http health probe #21

dhkron commented Dec 31, 2020

wavenator commented Jan 3, 2021

dhkron commented Jan 3, 2021 •

edited

Loading

wavenator commented Jan 3, 2021

dhkron commented Jan 3, 2021 •

edited

Loading

Feature: add http health probe #21

Feature: add http health probe #21

Comments

dhkron commented Dec 31, 2020

wavenator commented Jan 3, 2021

dhkron commented Jan 3, 2021 • edited Loading

wavenator commented Jan 3, 2021

dhkron commented Jan 3, 2021 • edited Loading

dhkron commented Jan 3, 2021 •

edited

Loading

dhkron commented Jan 3, 2021 •

edited

Loading