Skip to content

Robust_instance_launching

Matt Wagner edited this page Oct 25, 2012 · 1 revision

Robust instance launching

Summary

This page describes multi-instances deployment launch process

Owner

Jan Provaznik ([email protected])

Current status

  • Targeted release:
  • Last update:

When launching a deployment, deployment object is created and saved, Then ‘launch’ method is called on this deployment which creates required instances in conductor DB and associates them w/ the deployment object. Then it tries to find suitable ‘match’ (combination of hwp, provider account, realm) where all instances of this deployment can be launched. If a match is found, launch params are computed for all instances. Finally we iterate through all instances and try to launch them. If any instance launch fails, we set create_failed state on this instance and continue with next.
All of above steps are not in transaction, IOW if match is not found or launch params upload fails or instance launch fails, deployment and instances stay created. There is not retry or fallback plan if an error occurs (for example the provider of chosen match is not accessible).

Screencast Demo

  • Launch fails on both providers

    • Launch two deployment’s instances, third instance fails to launch
    • Launch on the first account should be rolled back - the two launched instances should be stopped
    • Same for second account
    • Deployment should be destroyed
    • Log for this launch should be created in a log
  • Launch on first provider account fails, succeeds on second provider account

    • Launch two deployment’s instances, third instance fails to launch
    • Launch on the first account should be rolled back -the two launched instances should be stopped
    • launch should be done on the second account and should be successful

Implementation tasks

Tasks which were already in Redmine cover whole deployment launch process:
#3060 - Refactor the launch process to include better error reporting, retries, switching to alternate providers etc.
#3061 - Ensure that the UI doesn’t contain unlaunched instances
#3062 - Ensure that multi-instance deployments always launch fully or not at all. Conductor should automatically clean-up partial deployments

Though it might be rewritten as:
* choose and integrate a tool for running background jobs (see “Background job” section)
* extend deployment model with “state” attribute (see “high level implementation” section)
* update deployment launch process to not create a deployment if launch prerequisites are not met
* update deployment launch process to be launched fully or not at all, make this option configurable
* update deployment launch process to support instance launch retry if an error occurs
* update deployment launch UI to display current launch state/progress

Detailed description

whole deployment launch process has 3 phases:
# pre-launch: a deployment object and instances objects are craeted in conductor db, we check if the deployment can be launched somewhere and if there are some deadlocks in instance params (see “instance params dependencies” bellow). If everything is OK, a background job for launching the deployment is enqueued. If anything goes wrong, we just call rollback, nothing is saved and user stays on launch page.
# launch: is done on background, instance launch params are uploaded to config server, dc-api create instance request for each instance is sent
# rollback/relaunch on instance state change (optional): If an instance launch fails for some reason we retry to launch the instance, if it doesn’t help we try to deploy somewhere else: stop all instances which have been already launched, then find another match (skipping all matches which failed), reset state to NEW for all instances (or drop and recreate them). Both number of retries before rollback and rollback itself are configurable.

*launch progress page (TBD)*
Angus suggested that there could be something like “launch progress page” where details of what’s being done w/ deployment would be showed. So if the user checks “show me details” checkbox before clicking “launch” button, he is redirected to this progress page where info which step is being done is displayed:
“Selecting provider account… account_name”
“Making launch request for instance… x”

This could be probably just displaying of all events associated with this deployment.
Showing of this page would be optional, alternatively it could be part of deployment’s show page where a user could redirected after launch.

High-level implementation details

Add ‘state’ attribute to Deployment model, states can be:
* new - deployment is created in Conductor DB, but no instance has been launched yet
* pending - at least one instance launch has been requested
* failed - final state, deployment launch/shutdown failed
* rollbackinprogress - an error occurred during launching an instance and there are already some launched instances which have to be stopped
* rollbackfailed - stopping of already launched instances failed
* rollback
complete - stopping of already launched instances, now the deployment can be launched somewhere else
* running - all instances were successfully are in running state
* incomplete - some instances are not running
* shutting_down - sthutdown was initiated
* stopped - all instances are stopped

Allowed state transitions:
* new > pending
* pending
> running|rollbackinprogress|failed
* rollbackinprogress > rollbackcomplete|rollbackfailed
* rollbackcomplete > pending|failed
* running
> shutting
down|incomplete
* incomplete
> running|shuttingdown
* shutting
down -> stopped

Deployment state will be used to track deployment’s history and decide what to do on a change - for example if last deployment’s instance is stopped, deployment relaunch is done only if deployment was in rollbackinprogress state, otherwise the deployment stays stopped.

State will be also used in UI for displaying deployment’s state - currently we use only 3 states: pending, running and failed and these are computed “per request” by checking state of all instances in deployment.

deployment_launch:
  in transaction do
    create deployment
    create deployment’s instances
    compute instances dependencies (covered by task 3054)
    find match where all instances can be launched (covered by task 3064)
    invoke instances_launch (on background)
  on error:
    deployment and instances are not created in conductor’s db
    user stays on deployment launch page
    proper error with reason why launch was not successful is displayed


instances_launch:
  for each deployment’s instance do
    check quota
    send dc api launch request
  on error:
    initiate deployment rollback


instance’s after update callback:
  if the instance is in failed state, try X retries in reasonable interval, if it's still not successful then invoke deployment_rollback


deployment_rollback:
  if all instances are stopped/failed invoke deployment_relaunch
  else send stop request to any instances in pending or running state


deployment_relaunch
  find new match where all instances can be launched (skipping matches which we tried before)
  if match is found, invoke instances_launch
  elsif match is not found, create log about failed launch in some history log (covered by scenario 3037) and destroy this deployment

Instance params dependencies
Because of instance launch-time params, there can be dependencies between instances. Good news is that Audrey’s Configserver will handle this itself. All instances can be launched immediately and Configserver will make sure that any services which depend on values from other instances will be launched after these values are available. More details about instance dependencies will be on separate page (scenario https://www.aeolusproject.org/redmine/issues/3035)), in short: conductor doesn’t care about launch order - just launches all instances at once.

Instance launch retries
If an error occurs when launching an instance, it makes sense to try re-launch the instance first before doing rollback - especially if it’s a deployment with multiple instances and some of them are already running. This might not be true in all situations because:
* retry won’t help in many error situations
* each retry means another delay in launch process, some users may prefer to not waste time by retrying especially if they have more than one provider which they can use

So this retry option would be ideally configurable by user and its implementation could be split into this tasks:\

  1. extend deployment model with an attribute to remember number of remaining retries\
  2. invoke instance relaunch from dbomatic in a constant interval (1 min) for instances which have number of remaining retries > 0 (and decrease this number with each retry)\
  3. add “number of retries” input field into deployment launch page, a reasonable default val is used by default (2 retries?)
    TODO: maybe it will be handy to add a new instance state for instances which failed but will be retried, not sure about this yet.

Rollback
Rollback should be optional too (IOW, it should be possible to launch incomplete deployment if a user wants it). TBD
Probably checkbox on the launch page.

Instance launch timeout
On deployment launch when an instance is in pending state for X minutes, the launch is terminated and deployment rollback is initiated.
This timeout should be configurable, default timeout could be 15 minutes?

Background job
Update: background processing is covered by separate scenario, see Background Processing
Will be used for launching instances. Even sending launch requests for a multi-instance deployment has following disadvantages:
* may take a long time, “connection timeout” error may be displayed
* user is blocked for this time, he sees only “loading” cursor
* we can’t inform a user about a launch progress/what’s being done

There are plenty background job tools for ruby/rails. Choosing suitable one is a task of this scenario. I would like to push it a little bit further:
dbomatic daemon might be replaced by this bg job tool. There is no reason to keep 2 things which does similar thing. I briefly searched what are mostly used tools and here are 2 examples:
# Delayed Job:
*** pros: we used this before, it worked fine, it’s packaged in fedora
*** cons: doesn’t support recurring jobs: https://github.com/collectiveidea/delayed_job/wiki/FEATURE:-Adding-Recurring-Job-Support-to-Delayed_Job Though workaround is re-enqueue the job everytime it’s executed
# Resque Scheduler:
*** pros: supports both recurring and single jobs, seems to work fine
*** cons: not packaged in fedora, dependency on external service (redis)

Not saying we have to use one of above, it’s just an inspiration. But I’d prefer a bg tool which supports recurring jobs “natively” so it could be dbomatic replacement.

Future plan

The above is short/mid-term solution how to improve instance launching, it doesn’t add any new dependency/tool. Long-term solution is to integrate Heat (https://github.com/heat-api)), which is expected to do all things we need (take care of deps between instances, launch instances in proper order, rollback of failed launch, monitoring…).

We don’t care about dependencies between instances when stopping a deployment.

References

Links and other references related to the feature.
Mails, IRC logs, documentation for libraries used, links to other parts of project documentation, etc.

Clone this wiki locally