You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tl;dr: let's modify the workers' idle time config+logic to be able to ping a service to determine whether they should shut down.
We're currently seeing slowness spinning worker pools up and down. Whether we have solved the worker-manager slowness or not, we can still make the behavior of idle workers more efficient. If we solve this before we rewrite worker-manager, we'll spin workers up and down less often, mitigating the current slowness.
It appears that workers shut themselves down after n seconds of idleness, even if that reduces us below the minimum: docker-worker, generic-worker.
This means that if we have a minimum of 10 workers, and we have 10 workers running but idle, they'll all shut down and worker-manager will spin them back up every n seconds.
If spinning up workers were instantaneous, it would be best to shut them all down and only spin them up on demand. Since it can take minutes or sometimes even hours to spin up new workers, and since human time is more valuable than machine time, we should keep a minimum number of critical worker pools running.
We've had people try to work around the worker-manager slowness issue by increasing the idle time, which works, but if we spin up 100 extra workers, all 100 extra workers will stick around for that extra idle time. Increasing the minimum worker capacity is even worse: we'll spin even more workers down and up every n seconds of idleness.
Proposal: If the workers query a central service: "Should I shut myself down on idle?", that service can answer "no" to the minimum number, and respond "yes" to the rest. With some additional metadata, we can tell the oldest workers "yes" and keep the newest workers running. We may want the workers to ping the service every ~15 minutes so the service has a better idea of which workers are still running.
We'll need to both write the worker configs+logic, as well as the service. Thoughts?
The text was updated successfully, but these errors were encountered:
This seems like a good idea, but it will be important to think about this in the context of a distributed system, which worker-manager is. In particular, "to the minimum number" needs to be defined. There is basically no value that is known "right now" -- not even the total number of running workers. There's an approximation to that, which is updated after each provisioning loop, but that may be delayed by several minutes. So a simple "shut down if numWorkers > workerPool.minWorkers will generally shut down too many workers.
I suspect this is a case where some degree of randomization is useful, so that if the pool is massively over-provisioned, then the shutdowns occur quickly, but if it is close to minWorkers, the shutdowns are less frequent so as not to shut down too many workers.
Note that worker-manager does know what workers are still running, both by polling the cloud provider's API and by continuing to get reregisterWorker calls from the workers. That was a big part of the rationale for reregisterWorker, in fact.
It may make sense to piggyback this should-I-shut-down information on the reregisterWorker calls, since those already occur regularly. A downside is that those calls are made by worker-runner, which doesn't currently know if the worker is idle or not. So you'll need to figure out how those two processes communicate that information as well -- whether the worker tells worker-runner its idle/active status (which can easily lead to race conditions) or worker-runner tells the worker whether to shut down when it is idle.
Tl;dr: let's modify the workers' idle time config+logic to be able to ping a service to determine whether they should shut down.
We're currently seeing slowness spinning worker pools up and down. Whether we have solved the worker-manager slowness or not, we can still make the behavior of idle workers more efficient. If we solve this before we rewrite worker-manager, we'll spin workers up and down less often, mitigating the current slowness.
It appears that workers shut themselves down after
n
seconds of idleness, even if that reduces us below the minimum: docker-worker, generic-worker.This means that if we have a minimum of 10 workers, and we have 10 workers running but idle, they'll all shut down and worker-manager will spin them back up every
n
seconds.If spinning up workers were instantaneous, it would be best to shut them all down and only spin them up on demand. Since it can take minutes or sometimes even hours to spin up new workers, and since human time is more valuable than machine time, we should keep a minimum number of critical worker pools running.
We've had people try to work around the worker-manager slowness issue by increasing the idle time, which works, but if we spin up 100 extra workers, all 100 extra workers will stick around for that extra idle time. Increasing the minimum worker capacity is even worse: we'll spin even more workers down and up every
n
seconds of idleness.Proposal: If the workers query a central service: "Should I shut myself down on idle?", that service can answer "no" to the minimum number, and respond "yes" to the rest. With some additional metadata, we can tell the oldest workers "yes" and keep the newest workers running. We may want the workers to ping the service every ~15 minutes so the service has a better idea of which workers are still running.
We'll need to both write the worker configs+logic, as well as the service. Thoughts?
The text was updated successfully, but these errors were encountered: