Suggestion: Include Predefined Checks for Supervision Tree Readiness #5

eric0fw · 2024-07-26T20:31:00Z

Hi,

First, I want to thank you for developing Existence. I have been exploring its features and it seems like an excellent tool for managing health checks in Elixir applications.

I have a suggestion for a predefined check that I believe could enhance the library: a built-in check to ensure that the entire supervision tree has fully started, utilizing a "terminator" process or similar mechanism. This would be particularly useful for readiness probes in Kubernetes deployments.

Proposed Feature:

Supervision Tree Readiness Check: A predefined health check that verifies whether the entire supervision tree is fully started. This could be implemented by adding a special "terminator" process as the last child in the supervision tree. The readiness check would then confirm that this process is running, indicating that all prior processes have been started successfully.

Benefits:

Reliability: Ensures that all necessary processes are up and running before marking the application as ready.
Ease of Use: Provides a ready-to-use solution for a common requirement, reducing the need for custom implementations.
Kubernetes Integration: Enhances the ability to use Existence with Kubernetes readiness probes, providing a more robust setup for production deployments.
Simplified Readiness Check: This check wouldn't be used on its own but could be useful in determining readiness without the need to implement numerous individual checks.

Considerations:

Flaws/Challenges: I am curious if there are any potential flaws or challenges you foresee with this approach. For example, are there edge cases or scenarios where this might not be reliable or could cause issues?

I appreciate your time and consideration of this suggestion. Looking forward to hearing your thoughts!

Best regards,

up2jj · 2024-08-02T11:27:26Z

Hey @eric0fw,

Thanks for your kind words :) We are using the library in production and it is not causing any problems.

The additional supervision tree check obviously makes sense for environments running in a k8 cluster. Depending on the complexity of the application, checking just the last process (child) may not be sufficient to consider the application ready:

previous processes may have been started, but their operation may cause repeated restarts. In such a situation, Supervisor.which_children/1 returns a status of :restarting instead of pid for such a process.
previous processes may have been started, but when the callback GenServer.init/1 returns :ignore, initialisation of the supervision tree continues but the process is ignored:

Returning :ignore will cause start_link/3 to return :ignore and the process will exit normally without entering the loop or calling terminate/2. If used when part of a supervision tree the parent supervisor will not fail to start nor immediately try to restart the GenServer. The remainder of the supervision tree will be started and so the GenServer should not be required by other processes.

source: https://hexdocs.pm/elixir/1.17.2/GenServer.html#c:init/1

In both of these situations I would consider the supervision tree not ready, especially if all processes are required to run the app.
Nevertheless, the problem is interesting and we could look into it.

@andrzej-mag I would be happy to hear your thoughts on this topic.

andrzej-mag · 2024-10-07T08:37:47Z

I am sorry for late reply. I didn't receive any notification and I wasn't aware of this discussion.

Application supervision trees are created from stable and trusted processes usually. If some supervised OTP process would crash infrequently, process would be restarted so quickly that it would be way below any orchestration platform time resolution threshold. If supervised process would crash continuously, application supervisor should terminate itself and application - no need for any additional health-checks. I don't see any practical value in health-checking application supervision tree or single tree processes. Situation is different with health-checking unsupervised external dependencies which can fail due to uncontrolled reasons (eg. network issue) for extended time.

To health-check some important OTP process you can use for example is_pid(Process.whereis(:some_important_process)) in the check callback implementation. If checked OTP process is supervised, supervisor restart intensity and period in reference to orchestration timeouts should be considered probably.

Existence should be started as a last process in the supervision tree as all health-checked dependencies should be started first anyway. Library will act then as a "terminator" process mentioned in the @eric0fw comment.
To use Existence as a readiness check with dependencies requiring substantial time to initialize you can set initial_delay in the check configuration.

up2jj self-assigned this Aug 2, 2024

up2jj added the enhancement New feature or request label Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Include Predefined Checks for Supervision Tree Readiness #5

Suggestion: Include Predefined Checks for Supervision Tree Readiness #5

eric0fw commented Jul 26, 2024

up2jj commented Aug 2, 2024

andrzej-mag commented Oct 7, 2024

Suggestion: Include Predefined Checks for Supervision Tree Readiness #5

Suggestion: Include Predefined Checks for Supervision Tree Readiness #5

Comments

eric0fw commented Jul 26, 2024

up2jj commented Aug 2, 2024

andrzej-mag commented Oct 7, 2024