Orphan workers after `restart` with a different `n` value #1091

dan-jensen · 2019-06-02T13:17:55Z

PROBLEM
Delayed Job essentially orphans workers when you change the worker count from 1 to 2+, or vice versa, and restart. By that I mean Delayed Job allows certain workers to continue running through stop, start and restart commands. The exact reasons are explained below, but the fundamental problem is that the delayed_job executable sometimes behaves like a service and other times like a worker management tool.

SIGNIFICANCE
This problem probably has moderate significance. Worker counts are not changed frequently. However, restart is the most common way of administering workers in production environments. More importantly, orphan workers can eat up memory, slow down servers and even cause insufficient memory failures. So while it might not happen frequently, it can happen easily in production environments and be serious when it does.

REPLICATION INSTRUCTIONS FOR MULTIPLE ORPHANS (DECREASE WORKER COUNT)

Execute script/delayed_job -n 5 restart
Observe there are 5 Delayed Job processes running, as expected
Execute script/delayed_job -n 1 start
Observe there are 6 Delayed Job workers running, when expecting 1; 5 orphan workers

REPLICATION INSTRUCTIONS FOR ONE ORPHAN (INCREASE WORKER COUNT)

Execute script/delayed_job -n 1 start
Observe there is 1 Delayed Job process running, as expected
Execute script/delayed_job -n 5 restart
Observe there are 6 Delayed Job processes running, when expecting 5; 1 orphan worker

ANALYSIS
The two examples above actually have different causes:

Increasing worker count causes an orphan worker due to the different naming convention for PID files when there is a single worker versus multiple (which has not changed since its introduction). But eliminating one naming convention (the single worker convention) would make orphans of all workers that currently exist under that convention. That would make this orphan worker problem worse, rather than better.
Decreasing worker count causes orphan worker(s) because Delayed Job only issues stop commands (via the Daemonize gem) for the first n workers specified. Worker n+1 and beyond are allowed to continue running through future commands and therefore orphaned.

PROPOSED SOLUTION
A solution to both causes: improve worker termination to avoid orphans. Specifically, restart should stop ALL workers (without regard for the n argument value) before starting n workers.

From a broader perspective, the proposed solution is that the delayed_job executable should NOT be like a worker management tool for starting and stopping arbitrary numbers of workers. It SHOULD be like a service that starts and stops all workers. That is how it makes the most intuitive sense. Plus, that is already how it works in the most basic context: start. Without any Delayed Job workers running, execute bin/delayed_job start twice. On the second invocation you encounter: "ERROR: there is already one or more instance(s) of the program running". (That error actually comes from the Daemons gem, but that's irrelevant.) This is the expected behavior of a service – if it is already running you can't start it again. If the delayed_job executable were a worker management tool then the second invocation would simply result in one more worker being started. Further, if this were a worker management tool, it should have a command to tell you how many workers are currently running, so you know how many to stop or restart, and it doesn't. This executiable is clearly intended to behave like a service, but isn't in the restart context (and others - see "FOLLOW-ON IMPROVEMENTS" below).

Please see PR #1090 for a proposed solution to the described issue with restart, which will also lay the foundation for follow-on improvements.

FOLLOW-ON IMPROVEMENTS
In addition to resolving the behavior of restart, these changes would also be required if the delayed_job executable is going to behave like a service:

start should raise an exception when any number of workers is running, not just the same number that was previously started (start then start raises an exception, but start then start -n 2 does not, again because of the different PID file naming conventions)
stop should stop all workers, not just the number specified by n. This is likely the fix for Issue Monitors do not shut down with script/delayed_job stop (with more than one worker running) #212.

The text was updated successfully, but these errors were encountered:

dan-jensen · 2024-04-09T18:44:15Z

In another issue the culprit was orphaned DelayedJob processes that needed to be manually killed from the command line. The proposal in this issue would solve that problem more permanently.

cat5inthecradle · 2024-11-05T02:17:17Z

Seems related to #1172

dan-jensen mentioned this issue Jun 2, 2019

Fix orphan workers on restart #1090

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orphan workers after `restart` with a different `n` value #1091

Orphan workers after `restart` with a different `n` value #1091

dan-jensen commented Jun 2, 2019

dan-jensen commented Apr 9, 2024

cat5inthecradle commented Nov 5, 2024

Orphan workers after restart with a different n value #1091

Orphan workers after restart with a different n value #1091

Comments

dan-jensen commented Jun 2, 2019

dan-jensen commented Apr 9, 2024

cat5inthecradle commented Nov 5, 2024

Orphan workers after `restart` with a different `n` value #1091

Orphan workers after `restart` with a different `n` value #1091