Background_Processing

Background Processing

As a developer I would like to be able to queue jobs for background processing. As an administrator, the job queue should be visible, and I should be able to effectively manage jobs as needed.

Owners

Richard Su (rwsu)

Current Status

Targeted Release: TBD

Screencast Demo

Show delayed_job and scheduler gems are available as rpms
Display contents of the queues through the command line
Pause queue. Create instance. Restart queue and show instance is created, running and has correct status.
Force an error. Show alert is sent and multiple alerts are throttled.
RHEV instance lifecycle works when dbomatic has been decommissioned.
Pause queue. Adjust number of workers.
Remove a job from the queue.
Show/Manage queue through Conductor UI.

Implementation Tasks

* As a developer I’d like to have delayed_job back into our rails environment\

need version 3.0\
should leverage the init scripts and configure settings we once had\
setup two named queues, one to manage instance lifecycle and default one for all other jobs\
create two workers for each queue\
infrastructure in place for others to work on jobs for instance and deployment lifecycle\
gem packaging

* As a administrator I’d like to view the contents of the queues through the command line\

aeolus-job? like aeolus-image?

* As a developer I’d like to have a scheduling tool in our rails environment that works with delayed_job\

decide on which scheduler to use. rufus-scheduler is one possibility\
gem packaging

* As a developer I want to move instance status checking from dbomatic into a background job\

the scheduler would queue a job for each provider account\
the scheduler should not duplicate jobs if one is in queue or is running\
unit tests - figure out how to test scheduling\
remove bits from dbomatic

* As a developer I want to move realm status checking from dbomatic into a background job\

same as above

* As an Administrator I should be alerted when an error occurs in a background processing job\

send email when there is an error, throttle emails of the same type\
job should be configured to retry once. 25 retries is the default\
warning alert sent with first failure and error alert sent if retry fails?\
integration with Conductor UI. make errors visible in Conductor UI

* As a Developer, when I create a RHEV instance, the system should enqueue a job to start it\

replaces RHEV instance start code in dbomatic\
start can only happen when instance as gone fro NEW ~~> PENDING~~> STOPPED state
As a Developer, I should not see dbomatic code in Conductor and Configure.
As an Administrator I want to stop workers and adjust the number of workers per queue
As an Administarator I want to remove jobs from the queues

* As an Administrator I would like to pause the scheduler\

Stopping workers may achieve the same end result. \
Scheduler would only queue at most one job of each type.
As an Administrator I would like to view and manage the queues with the Conductor UI

Candidate Solutions

The two most common solutions are delayedjobs and resque. There is a good write up on github comparing other background processing solutions and why they eventually steered towards delayedjobs and then resque, https://github.com/blog/542-introducing-resque.

The primary differences between delayed_jobs and resque are:

At the moment, delayed_jobs doesn’t have support for recurring jobs. Resque does support recuring jobs through the resque-scheduler extension/gem.

resque provides a sinatra app to monitor the queue. delayed_job doesn’t provide monitoring tools out of the box, but we can potential build something on top of rails or simply look at the contents of the database table.

resque requires multiple components and potentially could be more difficult to support. It requries a second gem called resque-scheduler. It also uses Redis as its backend and it is currently not available with RHEL. This may be the deal breaker.

Requirements and Fit with Candidate Solutions

\1. Bucket jobs into different queues. A long running job to check instance status for 1000 instances should not hold up other jobs. The solution should also support multiple workers which would minimize impact of longer running jobs. But using different queues will offer finer grain control.

delayed_job: supports multiple queues through named queues starting with version 3.0. Can start up multiple workers for all queues or for specific queues.
resque: supports multiple queues and workers.

\2. Jobs should persist in some way. If a crash occurs, we should be able to restart the system and continue with processing incomplete jobs in the queue.

delayed_job: Jobs persists as objects stored in activerecord entries.
resque: Jobs persists as json objects in redis entries. Using json objects instead of actual objects which may have advanced to a different version makes updating the application potentially easier.

\3. Recurring jobs.

delayed_jobs: Not available, in development.
resque: Through resque-scheduler extension.
whenever: A potential alternative to do cron style scheduling [6].

\4. Alerts. Failures should be presented to the user in some way (email, conductor UI) so that appropriate actions can be taken.

delayed_jobs: Support code hooks for different stages in the process. Hooks can be added for error, failure, success.. By default workers will retry a job 25 times. We should use a lower number. No sense in retrying that number of times and holding up the queue if there is a hard failure somewhere in the system. By default it also deletes failed jobs, but can be configured to leave them in the queue with a flag to indicate failure.
resque: Failed jobs can go through additional processing using different failure backends. redis, syslog, custom, etc..

\5. A mechanism to requeue a failed job once the underlying issue has been resolved. If an instance start job fails and there is a network failure to a provider. Once the network is back online, we should have an ability to requeue those jobs. Not sure if this should be automated or if this should be a button somewhere where a user can manually requeue all or select failed jobs.

custom

\6. Monitor job status. We should have some way to see what is in the queue.

delayed_jobs: Can only view queue through activerecord database entries. There is no UI so it is more difficult to see what is going on.
resque: Provides a sinatra app to monitor queues, jobs, and workers.

\7. Should not enqueue duplicate jobs.

custom

\8. Ability to remove jobs from the queues and to place a pause on the queues or jobs.

custom

\9. Supportable in Fedora and RHEL

delayed_jobs: We used it in the past. Will need to carry the gem.
resque: Will need to carry the gem. In addition it requires Redis as the backend. Redis is available in Fedora but not in RHEL. Redis is a open source project sponsored by VMware [4].

Use Cases

\1. Dbomatic replacement for instance and realm checking and RHEV instance start.

Each RHEV instance that is created will also lead to a job that is enqueued to start that instance.

Create a new job to perform instance status check. Create a status check job for each provider account. Allow status check job to be disabled/enabled per provider account.

Create a new job to sync realms for all providers. This can be broken up to a job per provider if needed.

Create two queues. One for managing instance lifecycle. And a second queue for all other jobs. Start with two workers per queue. Make the number of workers configurable so that it may be adjusted when needed.

\2. ldap syncing

\3. Generic instance start and stop