=======
This repository provisions resources on AWS, preparing them for a deployment of the application on an EKS cluster.
- An AWS account, preferably a new isolated one.
- Terraform >= 1.4.6
- A customer contract with Datafold
- The application does not work without credentials supplied by sales
- Access to our public helm-charts repository
This deployment will create the following resources:
- AWS VPC
- AWS subnet
- AWS S3 bucket for clickhouse backups
- AWS external load balancer
- AWS ACM certificate, unless preregistered and provided
- Three EBS volumes for local data storage
- AWS RDS Postgres database
- An EKS cluster
- Service accounts for the EKS cluster to perform actions outside of its cluster boundary:
- Provisioning existing EBS volumes
- Updating load balancer target group to point to specific pods in the cluster
- Rescaling the nodegroup between 1-2 nodes
- This module will not provision DNS names in your zone.
- See the example for a potential setup, which has dependencies on our helm-charts
Create the bucket and dynamodb table for terraform state file:
- Use the files in
bootstrap
to create a terraform state bucket and a dynamodb lock table. - Run
./run_bootstrap.sh
to create them. Enter the deployment_name when the question is asked.- The
deployment_name
is important. This is used for the k8s namespace and datadog unified logging tags and other places. - Suggestion:
company-datafold
- The
- Transfer the name of that bucket and table into the
backend.hcl
(symlinked into both infra and application) - Set the
target_account_profile
andregion
where the bucket / table are stored. backend.hcl
is only about where the terraform state file is located.
The example directory contains a single deployment example, which cleanly separates the
underlying runtime infra from the application deployment into kubernetes. Some specific
elements from the infra
directory are copied and encrypted into the application
directory.
Setting up the infrastructure:
- It is easiest if you have full admin access in the target project.
- Pre-create the ACM certificate you want to use on AWS and validate it in your DNS.
- Pre-create a symmetric encryption key that is used to encrypt/decrypt secrets of this deployment.
- Use the alias instead of the
mrk
link. Put that intolocals.tf
- Use the alias instead of the
- Refer to that certificate in main.tf using it's domain name: (Replace "datafold.acme.com")
- Change the settings in locals.tf (the versions in infra and application are sym-linked)
- provider_region = which region you want to deploy in.
- aws_profile = The profile you want to use to issue the deployments. Targets the deployment account.
- kms_profile = Can be the same profile, unless you want the encryption key elsewhere.
- kms_key = A pre-created symmetric KMS key. It's only purpose is for encryption/decryption of deployment secrets.
- deployment_name = The name of the deployment, used in kubernetes namespace, container naming and datadog "deployment" Unified Tag)
- Run
terraform init -backend-config=../backend.hcl
in both application and infra directory. - Our team will reach out to give you two secrets files:
application_secrets.yaml
goes into theapplication
directory.infra_secrets.yaml
goes into theinfra
directory.- Encrypt both files with sops and call both
secrets.yaml
- Run
terraform apply
ininfra
directory. This should complete ok.- Check in the console if you see the load balancer, the EKS cluster, etc.
- Run
terraform apply
inapplication
directory.- Check the settings made in the
main.tf
file. Maybe you want to set "datadog.install" tofalse
. - Check with your favourite kubernetes tool if you see the namespace and several datafold pods running there.
- Check the settings made in the
The module by default deploys in two availability zones. This is because by default, the subnets for private and public CIDR ranges have a list of two cidr ranges specified.
The AZ in which things get deployed depends on which AZ's get selected and in which order. This is an alphabetical ordering. In us-east this could be as many as 6 AZ's.
What the module does is sort the AZs and then it will iteratively deploy a public / private subnet specifying it's AZ in the module. Thus:
- [10.0.0.0/24] will get deployed in us-east-1a
- [10.0.1.0/24] will get deployed in us-east-1b
To deploy to three AZ's, you should override the public/private subnet settings. Then it will iterate across 3 elements, but the order of the AZ's will be the same by default.
You can add an "exclusion list" to the AZ ID's. The AZ ID is not the same as the AZ name. The AZ name on AWS is shuffled between their actual location across all AWS accounts. This means that your us-east-1a might be use1-az1 for you, but it might be use1-az4 for an account elsewhere. So if you need to match AZ's, you should match Availability zone ID's, not Availability zone names. The AZ ID is visible in the EC2 screen in the "settings" screen. There you see a list of enabled AZ's, their ID and their name.
To specifically select particular AZ ID's, exclude the ones you do not want in the az_id_exclude_filter. This is a list. That way, you can restrict this to only AZ's you want. Unfortunately it is an exclude filter and not an include filter. That means if AWS adds additional AZ's, it could create replacements for a future AZ.
Good news is that when there letters in use, I'd expect those letters to be maintained per AZ ID once they exist. Just for new accounts these can be shuffled all over again. So from terraform state perspective, things should be consistent at least.
The deployment is created and the initjob should have created the databases and done the initialization of the site settings.
If that didn't complete successfully, try to restart the job.
Once the deployment is complete and the initjob succeeded, we can set the install to that for false in config.yaml:
initjob:
install: false
Alternatively, here are the manual steps to achieve the same:
Establish a shell into the <deployment>-dfshell
container.
It is likely that the scheduler and server containers are crashing in a loop.
All we need to is to run these commands:
./manage.py clickhouse create-tables
./manage.py database create-or-upgrade
./manage.py installation set-new-deployment-params
Now all containers should be up and running.
Name | Version |
---|---|
aws | >= 4.8.0 |
dns | 3.2.1 |
Name | Version |
---|---|
aws | >= 4.8.0 |
random | n/a |
Name | Source | Version |
---|---|---|
clickhouse_backup | ./modules/clickhouse_backup | n/a |
database | ./modules/database | n/a |
eks | ./modules/eks | n/a |
load_balancer | ./modules/load_balancer | n/a |
networking | ./modules/networking | n/a |
security | ./modules/security | n/a |
Name | Type |
---|
Name | Description | Type | Default | Required |
---|---|---|---|---|
alb_certificate_domain | Pass a domain name like example.com to this variable in order to enable ALB HTTPS listeners. Terraform will try to find AWS certificate that is issued and matches asked domain, so please make sure that you have issued a certificate for asked domain already. |
string |
n/a | yes |
apply_major_upgrade | Sets the flag to allow AWS to apply major upgrade on the maintenance plan schedule. | bool |
false |
no |
aws_auth_accounts | List of account maps to add to the aws-auth configmap | list(any) |
[] |
no |
aws_auth_users | List of user maps to add to the aws-auth configmap | list(any) |
[] |
no |
backend_app_port | The target port to use for the backend services | number |
80 |
no |
clickhouse_data_size | EBS volume size for clickhouse data in GB | number |
40 |
no |
clickhouse_logs_size | EBS volume size for clickhouse logs in GB | number |
40 |
no |
clickhouse_s3_bucket | Bucket where clickhouse backups are stored | string |
"clickhouse-backups-abcguo23" |
no |
create_aws_auth_configmap | Whether to create the AWS authentication configmap | bool |
false |
no |
create_rds_kms_key | Set to true to create a separate KMS key (Recommended). | bool |
true |
no |
create_ssl_cert | Creates an SSL certificate is set. | bool |
n/a | yes |
database_name | RDS database name | string |
"datafold" |
no |
db_instance_tags | The extra tags to be applied to the RDS instance. | map(any) |
{} |
no |
db_parameter_group_tags | The extra tags to be applied to the parameter group | map(any) |
{} |
no |
db_subnet_group_tags | The extra tags to be applied to the parameter group | map(any) |
{} |
no |
default_node_disk_size | Disk size for a node in GB | number |
40 |
no |
deploy_vpc_flow_logs | Activates the VPC flow logs if set. | bool |
false |
no |
deployment_name | Name of the current deployment. | string |
n/a | yes |
dhcp_options_domain_name | Specifies DNS name for DHCP options set | string |
"" |
no |
dhcp_options_domain_name_servers | Specify a list of DNS server addresses for DHCP options set | list(string) |
[ |
no |
dhcp_options_tags | Tags applied to the DHCP options set. | map(string) |
{} |
no |
dns_egress_cidrs | List of Internet addresses to which the application has access | list(string) |
[] |
no |
ebs_extra_tags | The extra tags to be applied to the EBS volumes | map(any) |
{} |
no |
ebs_iops | IOPS of EBS volume | number |
3000 |
no |
ebs_throughput | Throughput of EBS volume | number |
1000 |
no |
ebs_type | Type of EBS volume | string |
"gp3" |
no |
enable_dhcp_options | Flag to use custom DHCP options for DNS resolution. | bool |
false |
no |
environment | Global environment tag to apply on all datadog logs, metrics, etc. | string |
n/a | yes |
host_override | Overrides the default domain name used to send links in invite emails and page links. Useful if the application is behind cloudflare for example. | string |
"" |
no |
ingress_enable_http_sg | Whether regular HTTP traffic should be allowed to access the load balancer | bool |
false |
no |
k8s_cluster_version | Ref. https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html | string |
"1.29" |
no |
k8s_module_version | EKS terraform module version | string |
"~> 19.7" |
no |
lb_idle_timeout | The time in seconds that the connection is allowed to be idle. | number |
120 |
no |
lb_internal | Set to true to make the load balancer internal and not exposed to the internet. | bool |
false |
no |
manage_aws_auth_configmap | Determines whether to manage the aws-auth configmap | bool |
false |
no |
managed_node_grp | Ref. https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest/submodules/eks-managed-node-group | any |
n/a | yes |
managed_node_grp_default | Ref. https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt | list(any) |
[] |
no |
nat_gateway_public_ip | Public IP of the NAT gateway when reusing the NAT gateway instead of recreating | string |
"" |
no |
private_subnet_tags | The extra tags to be applied to the private subnets | map(any) |
{} |
no |
propagate_intra_route_tables_vgw | If intra subnets should propagate traffic. | bool |
false |
no |
propagate_private_route_tables_vgw | If private subnets should propagate traffic. | bool |
false |
no |
propagate_public_route_tables_vgw | If public subnets should propagate traffic. | bool |
false |
no |
provider_azs | List of availability zones to consider. If empty, the modules will determine this dynamically. | list(string) |
[] |
no |
provider_region | The AWS region in which the infrastructure should be deployed | string |
n/a | yes |
public_subnet_tags | The extra tags to be applied to the public subnets | map(any) |
{} |
no |
rds_allocated_storage | The size of RDS allocated storage in GB | number |
20 |
no |
rds_backups_replication_retention_period | RDS backup replication retention period | number |
14 |
no |
rds_backups_replication_target_region | RDS backup replication target region | string |
null |
no |
rds_extra_tags | The extra tags to be applied to the RDS instance | map(any) |
{} |
no |
rds_instance | EC2 insance type for PostgreSQL RDS database. Available instance groups: t3, m4, m5. Available instance classes: medium and higher. |
string |
"db.t3.medium" |
no |
rds_kms_key_alias | RDS KMS key alias. | string |
"datafold-rds" |
no |
rds_max_allocated_storage | The upper limit the database can grow in GB | number |
100 |
no |
rds_param_group_family | The DB parameter group family to use | string |
"postgres15" |
no |
rds_port | Port the RDS database should be listening on. | number |
5432 |
no |
rds_ro_username | RDS read-only user name (not currently used). | string |
"datafold_ro" |
no |
rds_username | Overrides the default RDS user name that is provisioned. | string |
"datafold" |
no |
rds_version | Postgres RDS version to use. | string |
"15.5" |
no |
redis_data_size | Redis EBS volume size in GB | number |
10 |
no |
s3_clickhouse_backup_tags | The extra tags to be applied to the S3 clickhouse backup bucket | map(any) |
{} |
no |
self_managed_node_grp | Ref. https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest/submodules/self-managed-node-group | any |
{} |
no |
self_managed_node_grp_default | Configuration for the self managed node group | any |
{} |
no |
self_managed_node_grp_instance_type | Ref. https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt | string |
"THe instance type for the self managed node group." |
no |
sg_tags | The extra tags to be applied to the security group | map(any) |
{} |
no |
tags | Tags to apply to the general module | any |
{} |
no |
use_default_rds_kms_key | Flag weither or not to use the default RDS KMS encryption key. Not recommended. | bool |
false |
no |
vpc_cidr | The CIDR of the new VPC, if the vpc_cidr is not set | string |
"10.0.0.0/16" |
no |
vpc_id | The VPC ID of an existing VPC to deploy the cluster in. Creates a new VPC if not set. | string |
"" |
no |
vpc_private_subnets | The private subnet CIDR ranges when a new VPC is created. | list(string) |
[ |
no |
vpc_propagating_vgws | ID's of virtual private gateways to propagate. | list(any) |
[] |
no |
vpc_public_subnets | The public network CIDR ranges | list(string) |
[ |
no |
vpc_tags | The extra tags to be applied to the VPC | map(any) |
{} |
no |
vpc_vpn_gateway_id | ID of the VPN gateway to attach to the VPC | string |
"" |
no |
whitelisted_egress_cidrs | List of Internet addresses the application can access going outside | list(string) |
n/a | yes |
whitelisted_ingress_cidrs | List of CIDRs that can pass through the load balancer | list(string) |
n/a | yes |
Name | Description |
---|---|
clickhouse_access_key | The access key of the IAM user doing the clickhouse backups. |
clickhouse_data_size | The size in GB of the clickhouse EBS data volume |
clickhouse_data_volume_id | The EBS volume ID where clickhouse data will be stored. |
clickhouse_logs_size | The size in GB of the clickhouse EBS logs volume |
clickhouse_logs_volume_id | The EBS volume ID where clickhouse logs will be stored. |
clickhouse_password | The generated clickhouse password to be used in the application deployment |
clickhouse_s3_bucket | The location of the S3 bucket where clickhouse backups are stored |
clickhouse_s3_region | The region where the S3 bucket is created |
clickhouse_secret_key | The secret key of the IAM user doing the clickhouse backups. |
cloud_provider | A string describing the type of cloud provider to be passed onto the helm charts |
cluster_name | The name of the EKS cluster |
cluster_scaler_role_arn | The ARN of the role that is able to scale the EKS cluster nodes. |
db_instance_id | The ID of the RDS database instance |
deployment_name | The name of the deployment |
domain_name | The domain name to be used in DNS configuration |
k8s_load_balancer_controller_role_arn | The ARN of the role provisioned so the k8s cluster can edit the target group through the AWS load balancer controller. |
lb_name | The name of the external load balancer |
load_balancer_ips | The load balancer IP when it was provisioned. |
postgres_database_name | The name of the pre-provisioned database. |
postgres_host | The DNS name for the postgres database |
postgres_password | The generated postgres password to be used by the application |
postgres_port | The port configured for the RDS database |
postgres_username | The postgres username to be used by the application |
redis_data_size | The size in GB of the Redis data volume. |
redis_data_volume_id | The EBS volume ID of the Redis data volume. |
redis_password | The generated redis password to be used in the application deployment |
security_group_id | The security group ID managing ingress from the load balancer |
target_group_arn | The ARN to the target group where the pods need to be registered as targets. |
vpc_cidr | The CIDR of the entire VPC |