-
Notifications
You must be signed in to change notification settings - Fork 0
Robust_instance_launching
This page describes multi-instances deployment launch process
Jan Provaznik ([email protected])
- Targeted release:
- Last update:
When launching a deployment, deployment object is created and saved,
Then ‘launch’ method is called on this deployment which creates required
instances in conductor DB and associates them w/ the deployment object.
Then it tries to find suitable ‘match’ (combination of hwp, provider
account, realm) where all instances of this deployment can be launched.
If a match is found, launch params are computed for all instances.
Finally we iterate through all instances and try to launch them. If any
instance launch fails, we set create_failed state on this instance and
continue with next.
All of above steps are not in transaction, IOW if match is not found or
launch params upload fails or instance launch fails, deployment and
instances stay created. There is not retry or fallback plan if an error
occurs (for example the provider of chosen match is not accessible).
-
Launch fails on both providers
- Launch two deployment’s instances, third instance fails to launch
- Launch on the first account should be rolled back - the two launched instances should be stopped
- Same for second account
- Deployment should be destroyed
- Log for this launch should be created in a log
-
Launch on first provider account fails, succeeds on second provider account
- Launch two deployment’s instances, third instance fails to launch
- Launch on the first account should be rolled back -the two launched instances should be stopped
- launch should be done on the second account and should be successful
Tasks which were already in Redmine cover whole deployment launch
process:
#3060 - Refactor the launch process to include better error reporting,
retries, switching to alternate providers etc.
#3061 - Ensure that the UI doesn’t contain unlaunched instances
#3062 - Ensure that multi-instance deployments always launch fully or
not at all. Conductor should automatically clean-up partial deployments
Though it might be rewritten as:
* choose and integrate a tool for running background jobs (see
“Background job” section)
* extend deployment model with “state” attribute (see “high level
implementation” section)
* update deployment launch process to not create a deployment if launch
prerequisites are not met
* update deployment launch process to be launched fully or not at all,
make this option configurable
* update deployment launch process to support instance launch retry if
an error occurs
* update deployment launch UI to display current launch state/progress
whole deployment launch process has 3 phases:
# pre-launch: a deployment object and instances objects are craeted in
conductor db, we check if the deployment can be launched somewhere and
if there are some deadlocks in instance params (see “instance params
dependencies” bellow). If everything is OK, a background job for
launching the deployment is enqueued. If anything goes wrong, we just
call rollback, nothing is saved and user stays on launch page.
# launch: is done on background, instance launch params are uploaded to
config server, dc-api create instance request for each instance is sent
# rollback/relaunch on instance state change (optional): If an instance
launch fails for some reason we retry to launch the instance, if it
doesn’t help we try to deploy somewhere else: stop all instances which
have been already launched, then find another match (skipping all
matches which failed), reset state to NEW for all instances (or drop and
recreate them). Both number of retries before rollback and rollback
itself are configurable.
*launch progress page (TBD)*
Angus suggested that there could be something like “launch progress
page” where details of what’s being done w/ deployment would be showed.
So if the user checks “show me details” checkbox before clicking
“launch” button, he is redirected to this progress page where info which
step is being done is displayed:
“Selecting provider account… account_name”
“Making launch request for instance… x”
…
This could be probably just displaying of all events associated with
this deployment.
Showing of this page would be optional, alternatively it could be part
of deployment’s show page where a user could redirected after launch.
Add ‘state’ attribute to Deployment model, states can be:
* new - deployment is created in Conductor DB, but no instance has been
launched yet
* pending - at least one instance launch has been requested
* failed - final state, deployment launch/shutdown failed
* rollbackinprogress - an error occurred during launching an instance
and there are already some launched instances which have to be stopped
* rollbackfailed - stopping of already launched instances failed
* rollbackcomplete - stopping of already launched instances, now the
deployment can be launched somewhere else
* running - all instances were successfully are in running state
* incomplete - some instances are not running
* shutting_down - sthutdown was initiated
* stopped - all instances are stopped
Allowed state transitions:
* new > pending> running|rollbackinprogress|failed
* pending
* rollbackinprogress > rollbackcomplete|rollbackfailed> running|shuttingdown
* rollbackcomplete > pending|failed> shuttingdown|incomplete
* running
* incomplete
* shuttingdown -> stopped
Deployment state will be used to track deployment’s history and decide what to do on a change - for example if last deployment’s instance is stopped, deployment relaunch is done only if deployment was in rollbackinprogress state, otherwise the deployment stays stopped.
State will be also used in UI for displaying deployment’s state - currently we use only 3 states: pending, running and failed and these are computed “per request” by checking state of all instances in deployment.
deployment_launch:
in transaction do
create deployment
create deployment’s instances
compute instances dependencies (covered by task 3054)
find match where all instances can be launched (covered by task 3064)
invoke instances_launch (on background)
on error:
deployment and instances are not created in conductor’s db
user stays on deployment launch page
proper error with reason why launch was not successful is displayed
instances_launch:
for each deployment’s instance do
check quota
send dc api launch request
on error:
initiate deployment rollback
instance’s after update callback:
if the instance is in failed state, try X retries in reasonable interval, if it's still not successful then invoke deployment_rollback
deployment_rollback:
if all instances are stopped/failed invoke deployment_relaunch
else send stop request to any instances in pending or running state
deployment_relaunch
find new match where all instances can be launched (skipping matches which we tried before)
if match is found, invoke instances_launch
elsif match is not found, create log about failed launch in some history log (covered by scenario 3037) and destroy this deployment
Instance params dependencies
Because of instance launch-time params, there can be dependencies
between instances. Good news is that Audrey’s Configserver will handle
this itself. All instances can be launched immediately and Configserver
will make sure that any services which depend on values from other
instances will be launched after these values are available. More
details about instance dependencies will be on separate page (scenario
https://www.aeolusproject.org/redmine/issues/3035)),
in short: conductor doesn’t care about launch order - just launches all
instances at once.
Instance launch retries
If an error occurs when launching an instance, it makes sense to try
re-launch the instance first before doing rollback - especially if it’s
a deployment with multiple instances and some of them are already
running. This might not be true in all situations because:
* retry won’t help in many error situations
* each retry means another delay in launch process, some users may
prefer to not waste time by retrying especially if they have more than
one provider which they can use
So this retry option would be ideally configurable by user and its implementation could be split into this tasks:\
- extend deployment model with an attribute to remember number of remaining retries\
- invoke instance relaunch from dbomatic in a constant interval (1 min) for instances which have number of remaining retries > 0 (and decrease this number with each retry)\
- add “number of retries” input field into deployment launch page, a
reasonable default val is used by default (2 retries?)
TODO: maybe it will be handy to add a new instance state for instances which failed but will be retried, not sure about this yet.
Rollback
Rollback should be optional too (IOW, it should be possible to launch
incomplete deployment if a user wants it). TBD
Probably checkbox on the launch page.
Instance launch timeout
On deployment launch when an instance is in pending state for X minutes,
the launch is terminated and deployment rollback is initiated.
This timeout should be configurable, default timeout could be 15
minutes?
Background job
Update: background processing is covered by separate scenario, see
Background Processing
Will be used for launching instances. Even sending launch requests for a
multi-instance deployment has following disadvantages:
* may take a long time, “connection timeout” error may be displayed
* user is blocked for this time, he sees only “loading” cursor
* we can’t inform a user about a launch progress/what’s being done
There are plenty background job tools for ruby/rails. Choosing suitable
one is a task of this scenario. I would like to push it a little bit
further:
dbomatic daemon might be replaced by this bg job tool. There is no
reason to keep 2 things which does similar thing. I briefly searched
what are mostly used tools and here are 2 examples:
# Delayed Job:
*** pros: we used this before, it worked fine, it’s packaged in fedora
*** cons: doesn’t support recurring jobs:
https://github.com/collectiveidea/delayed_job/wiki/FEATURE:-Adding-Recurring-Job-Support-to-Delayed_Job
Though workaround is re-enqueue the job everytime it’s executed
# Resque Scheduler:
*** pros: supports both recurring and single jobs, seems to work fine
*** cons: not packaged in fedora, dependency on external service
(redis)
Not saying we have to use one of above, it’s just an inspiration. But I’d prefer a bg tool which supports recurring jobs “natively” so it could be dbomatic replacement.
The above is short/mid-term solution how to improve instance launching, it doesn’t add any new dependency/tool. Long-term solution is to integrate Heat (https://github.com/heat-api)), which is expected to do all things we need (take care of deps between instances, launch instances in proper order, rollback of failed launch, monitoring…).
We don’t care about dependencies between instances when stopping a deployment.
Links and other references related to the feature.
Mails, IRC logs, documentation for libraries used, links to other parts
of project documentation, etc.