You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PROBLEM
Delayed Job essentially orphans workers when you change the worker count from 1 to 2+, or vice versa, and restart. By that I mean Delayed Job allows certain workers to continue running through stop, start and restart commands. The exact reasons are explained below, but the fundamental problem is that the delayed_job executable sometimes behaves like a service and other times like a worker management tool.
SIGNIFICANCE
This problem probably has moderate significance. Worker counts are not changed frequently. However, restart is the most common way of administering workers in production environments. More importantly, orphan workers can eat up memory, slow down servers and even cause insufficient memory failures. So while it might not happen frequently, it can happen easily in production environments and be serious when it does.
REPLICATION INSTRUCTIONS FOR MULTIPLE ORPHANS (DECREASE WORKER COUNT)
Execute script/delayed_job -n 5 restart
Observe there are 5 Delayed Job processes running, as expected
Execute script/delayed_job -n 1 start
Observe there are 6 Delayed Job workers running, when expecting 1; 5 orphan workers
REPLICATION INSTRUCTIONS FOR ONE ORPHAN (INCREASE WORKER COUNT)
Execute script/delayed_job -n 1 start
Observe there is 1 Delayed Job process running, as expected
Execute script/delayed_job -n 5 restart
Observe there are 6 Delayed Job processes running, when expecting 5; 1 orphan worker
ANALYSIS
The two examples above actually have different causes:
Increasing worker count causes an orphan worker due to the different naming convention for PID files when there is a single worker versus multiple (which has not changed since its introduction). But eliminating one naming convention (the single worker convention) would make orphans of all workers that currently exist under that convention. That would make this orphan worker problem worse, rather than better.
Decreasing worker count causes orphan worker(s) because Delayed Job only issues stop commands (via the Daemonize gem) for the first n workers specified. Worker n+1 and beyond are allowed to continue running through future commands and therefore orphaned.
PROPOSED SOLUTION
A solution to both causes: improve worker termination to avoid orphans. Specifically, restart should stop ALL workers (without regard for the n argument value) before starting n workers.
From a broader perspective, the proposed solution is that the delayed_job executable should NOT be like a worker management tool for starting and stopping arbitrary numbers of workers. It SHOULD be like a service that starts and stops all workers. That is how it makes the most intuitive sense. Plus, that is already how it works in the most basic context: start. Without any Delayed Job workers running, execute bin/delayed_job start twice. On the second invocation you encounter: "ERROR: there is already one or more instance(s) of the program running". (That error actually comes from the Daemons gem, but that's irrelevant.) This is the expected behavior of a service – if it is already running you can't start it again. If the delayed_job executable were a worker management tool then the second invocation would simply result in one more worker being started. Further, if this were a worker management tool, it should have a command to tell you how many workers are currently running, so you know how many to stop or restart, and it doesn't. This executiable is clearly intended to behave like a service, but isn't in the restart context (and others - see "FOLLOW-ON IMPROVEMENTS" below).
Please see PR #1090 for a proposed solution to the described issue with restart, which will also lay the foundation for follow-on improvements.
FOLLOW-ON IMPROVEMENTS
In addition to resolving the behavior of restart, these changes would also be required if the delayed_job executable is going to behave like a service:
start should raise an exception when any number of workers is running, not just the same number that was previously started (start then start raises an exception, but start then start -n 2 does not, again because of the different PID file naming conventions)
In another issue the culprit was orphaned DelayedJob processes that needed to be manually killed from the command line. The proposal in this issue would solve that problem more permanently.
PROBLEM
Delayed Job essentially orphans workers when you change the worker count from 1 to 2+, or vice versa, and
restart
. By that I mean Delayed Job allows certain workers to continue running throughstop
,start
andrestart
commands. The exact reasons are explained below, but the fundamental problem is that thedelayed_job
executable sometimes behaves like a service and other times like a worker management tool.SIGNIFICANCE
This problem probably has moderate significance. Worker counts are not changed frequently. However,
restart
is the most common way of administering workers in production environments. More importantly, orphan workers can eat up memory, slow down servers and even cause insufficient memory failures. So while it might not happen frequently, it can happen easily in production environments and be serious when it does.REPLICATION INSTRUCTIONS FOR MULTIPLE ORPHANS (DECREASE WORKER COUNT)
script/delayed_job -n 5 restart
script/delayed_job -n 1 start
REPLICATION INSTRUCTIONS FOR ONE ORPHAN (INCREASE WORKER COUNT)
script/delayed_job -n 1 start
script/delayed_job -n 5 restart
ANALYSIS
The two examples above actually have different causes:
stop
commands (via the Daemonize gem) for the firstn
workers specified. Worker n+1 and beyond are allowed to continue running through future commands and therefore orphaned.PROPOSED SOLUTION
A solution to both causes: improve worker termination to avoid orphans. Specifically,
restart
should stop ALL workers (without regard for then
argument value) before startingn
workers.From a broader perspective, the proposed solution is that the delayed_job executable should NOT be like a worker management tool for starting and stopping arbitrary numbers of workers. It SHOULD be like a service that starts and stops all workers. That is how it makes the most intuitive sense. Plus, that is already how it works in the most basic context:
start
. Without any Delayed Job workers running, executebin/delayed_job start
twice. On the second invocation you encounter: "ERROR: there is already one or more instance(s) of the program running". (That error actually comes from the Daemons gem, but that's irrelevant.) This is the expected behavior of a service – if it is already running you can't start it again. If the delayed_job executable were a worker management tool then the second invocation would simply result in one more worker being started. Further, if this were a worker management tool, it should have a command to tell you how many workers are currently running, so you know how many to stop or restart, and it doesn't. This executiable is clearly intended to behave like a service, but isn't in therestart
context (and others - see "FOLLOW-ON IMPROVEMENTS" below).Please see PR #1090 for a proposed solution to the described issue with
restart
, which will also lay the foundation for follow-on improvements.FOLLOW-ON IMPROVEMENTS
In addition to resolving the behavior of
restart
, these changes would also be required if thedelayed_job
executable is going to behave like a service:start
should raise an exception when any number of workers is running, not just the same number that was previously started (start
thenstart
raises an exception, butstart
thenstart -n 2
does not, again because of the different PID file naming conventions)stop
should stop all workers, not just the number specified byn
. This is likely the fix for Issue Monitors do not shut down with script/delayed_job stop (with more than one worker running) #212.The text was updated successfully, but these errors were encountered: