-
-
Notifications
You must be signed in to change notification settings - Fork 69
Home
Automated EC2 Lifecycle
Every EC2 instance goes through the following stages:
- Pending (
cfn-init
)- Install Airflow + Celery
- Load secrets from the AWS SSM Parameter Store
- Depending on the instance, setup the appropriate service:
- For scheduler instances, enable
airflow-scheduler
- For webserver instances, enable
airflow-webserver
- For celery worker instances, enable
airflow-workerset
- For scheduler instances, enable
- Enable
airflow-confapply
, a sidecar that restarts airflow services if:- Something changes in the airflow environment variables.
CloudFormation Integration
Some template parameters are directly related to the Airflow configuration. Stack updates that change these parameters will trigger changes to each Airflow service Auto Scaling Group's Launch Configuration metadata and newer instances will be instantiated with the updated parameters. Still, old instances must have these changes propagated internally to already running services.
This is achieved leveraging the CloudFormation Helper Scripts suite. Every instance has a cfn-hup
service that watches for metadata changes in the CloudFormation template and takes care of triggering the setup process for already running EC2 instances. This will override old configurations and thanks to the airflow-confapply
service, restart the airflow process to pick up the new parameters with minimal impact.
Systems Manager Integration
When operating a distributed system like Airflow, it's frequently useful to manage or inspect all or some moving parts. Running the cluster on EC2 this would usually mean setting up public SSH ports or provisioning bastion hosts for private subnets to be able to access each part securely, but even so this still increases the attack surface by having more shared secret keys to manage while offering very little tooling to help operators to automate maintenance tasks.
This stack uses the latest Amazon Linux AMIs equipped with SSM Agents so you can leverage the full capabilities of AWS Systems Manager to execute remote commands or scripts against a collection of EC2 instances at once. You can also use the Session Manager for quick inspections and routine operation tasks that may require CLI access to individual instances, working on top of existing IAM policies and is also available on the AWS Console. SSM also add auditing capabilities by logging past operations and managed sessions.
CodeDeploy Integration
Deploying Airflow on distributed persistent workers can be tricky. A frequently used approach is to use a shared network directory to keep all instances in sync with configuration and DAG files, because having a single source of truth makes deployment much easier (e.g. using git-sync). One thing to keep in mind is that updating files in the middle of the execution of a task might have unintended consequences, like the first few bash operators running scripts from the old revision and the last few bash operators running incompatible scripts from the new revision. Safely deploying new code requires the workers to stop, which means a single shared directory requires all the workers to stop simultaneously, something hard to orchestrate.
Thanks to AWS CodeDeploy, distributing the Airflow configs and DAG files to all individual instances is completely automated and centralized. Each instance is equipped with the codedeploy-agent
that polls for pending deployments and takes care of the installation process. Just generate a new deployment package using the CLI or other tools like CodePipeline and CodeDeploy will take care of the rest through its agents. This process is easy to adopt, allows fast release cycles and is flexible enough to handle complex upgrading scenarios like requiring airflow services to restart or installing additional packages.
Nevertheless, it's important to make sure deployments are backwards compatible in terms of message exchanging between the scheduler and the worker instances in a similar way to any asynchronous message based system. More information on how to safely deploy DAGs and configuration changes can be found on a specific document.
... under construction...
[efs]:
... under construction...
... under construction...
... is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
... under construction...
... the property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its components.
... under construction...