Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orphan workers after restart with a different n value #1091

Open
dan-jensen opened this issue Jun 2, 2019 · 2 comments
Open

Orphan workers after restart with a different n value #1091

dan-jensen opened this issue Jun 2, 2019 · 2 comments

Comments

@dan-jensen
Copy link

PROBLEM
Delayed Job essentially orphans workers when you change the worker count from 1 to 2+, or vice versa, and restart. By that I mean Delayed Job allows certain workers to continue running through stop, start and restart commands. The exact reasons are explained below, but the fundamental problem is that the delayed_job executable sometimes behaves like a service and other times like a worker management tool.

SIGNIFICANCE
This problem probably has moderate significance. Worker counts are not changed frequently. However, restart is the most common way of administering workers in production environments. More importantly, orphan workers can eat up memory, slow down servers and even cause insufficient memory failures. So while it might not happen frequently, it can happen easily in production environments and be serious when it does.

REPLICATION INSTRUCTIONS FOR MULTIPLE ORPHANS (DECREASE WORKER COUNT)

  • Execute script/delayed_job -n 5 restart
  • Observe there are 5 Delayed Job processes running, as expected
  • Execute script/delayed_job -n 1 start
  • Observe there are 6 Delayed Job workers running, when expecting 1; 5 orphan workers

REPLICATION INSTRUCTIONS FOR ONE ORPHAN (INCREASE WORKER COUNT)

  • Execute script/delayed_job -n 1 start
  • Observe there is 1 Delayed Job process running, as expected
  • Execute script/delayed_job -n 5 restart
  • Observe there are 6 Delayed Job processes running, when expecting 5; 1 orphan worker

ANALYSIS
The two examples above actually have different causes:

  1. Increasing worker count causes an orphan worker due to the different naming convention for PID files when there is a single worker versus multiple (which has not changed since its introduction). But eliminating one naming convention (the single worker convention) would make orphans of all workers that currently exist under that convention. That would make this orphan worker problem worse, rather than better.
  2. Decreasing worker count causes orphan worker(s) because Delayed Job only issues stop commands (via the Daemonize gem) for the first n workers specified. Worker n+1 and beyond are allowed to continue running through future commands and therefore orphaned.

PROPOSED SOLUTION
A solution to both causes: improve worker termination to avoid orphans. Specifically, restart should stop ALL workers (without regard for the n argument value) before starting n workers.

From a broader perspective, the proposed solution is that the delayed_job executable should NOT be like a worker management tool for starting and stopping arbitrary numbers of workers. It SHOULD be like a service that starts and stops all workers. That is how it makes the most intuitive sense. Plus, that is already how it works in the most basic context: start. Without any Delayed Job workers running, execute bin/delayed_job start twice. On the second invocation you encounter: "ERROR: there is already one or more instance(s) of the program running". (That error actually comes from the Daemons gem, but that's irrelevant.) This is the expected behavior of a service – if it is already running you can't start it again. If the delayed_job executable were a worker management tool then the second invocation would simply result in one more worker being started. Further, if this were a worker management tool, it should have a command to tell you how many workers are currently running, so you know how many to stop or restart, and it doesn't. This executiable is clearly intended to behave like a service, but isn't in the restart context (and others - see "FOLLOW-ON IMPROVEMENTS" below).

Please see PR #1090 for a proposed solution to the described issue with restart, which will also lay the foundation for follow-on improvements.

FOLLOW-ON IMPROVEMENTS
In addition to resolving the behavior of restart, these changes would also be required if the delayed_job executable is going to behave like a service:

@dan-jensen
Copy link
Author

In another issue the culprit was orphaned DelayedJob processes that needed to be manually killed from the command line. The proposal in this issue would solve that problem more permanently.

@cat5inthecradle
Copy link

Seems related to #1172

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants