This Utility is used to replicate Glue Data Catalog from one AWS account to another AWS account. Using this, you can replicate Databases, Tables, and Partitions from one source AWS account to one or more target AWS accounts. It uses AWS Glue APIs / AWS SDK for Java and serverless technologies such as AWS Lambda, Amazon SQS, and Amazon SNS. The architecture of this utility is shown in the following diagram.
Follow the instructions in this README.md to deploy this utility through CloudFormation in your AWS accounts. Otherwise follow the guide below for a manual deployment.
- The source code has Maven nature, you can build it using standard Maven commands e.g.
mvn -X clean install
. or use the options available in your IDE - The above step generates a Jar file e.g. aws-glue-data-catalog-replication-utility-1.0.0.jar
This utility requires the following AWS services
- 3 AWS Lambda functions
- 3 Amazon DynamoDB tables
- 2 Amazon SNS Topics
- 1 Amazon SQS Queue
- 1 Amazon S3 Bucket
- 3 AWS Lambda functions
- 2 Amazon DynamoDB tables
- 2 Amazon SQS Queues
Class | Purpose |
---|---|
GDCReplicationPlannerLambda | Lambda function determines the list of databases to export. It is the driver program initiates the replication process. |
ExportLambda | Lambda function to export databases and tables. |
ExportLargeTableLambda | Lambda function to export large tables tables with more than 10 partitions. |
ImportLambda | Lambda function to import databases and tables. |
ImportLargeTableLambda | Lambda function to import large tables. |
DLQProcessorLambda | Lambda function used to process errors generated by ImportLambda. |
-
Create DynamoDB tables as defined in the following table
Table Purpose Schema Capacity glue_database_export_task audit data for replication planner Partition key - db_id (String), Sort key - export_run_id (Number) On-Demand db_status audit data for databases exported Partition key - db_id (String), Sort key - export_run_id (Number) On-Demand table_status audit data for tables exported Partition key - table_id (String), Sort key - export_run_id (Number) On-Demand -
Create two SNS Topics
- Topic 1: Name = e.g.
ReplicationPlannerSNSTopic
- Topic 2: Name = e.g.
SchemaDistributionSNSTopic
- Topic 1: Name = e.g.
-
Create an S3 Bucket. It is used to save partitions for large tables (partitions > 10). This bucket must provide cross-account permissions to the IAM roles used by ImportLargeTable Lambda function in Target Account. Refer the following AWS resources for more details:
-
Create one SQS Queue
- Queue Name = e.g.
LargeTableSQSQueue
- Queue Type = Standard
- Default Visibility Timeout = e.g. 3 minutes 15 seconds. Note: It must be higher than execution timeout of ExportLargeTable Lambda Function
- Queue Name = e.g.
-
Create Lambda Execution IAM Role and attach it to the Lambda functions deployed in Source Account. This role needs to have multiple permissions. Refer the following IAM policies to know about required permissions:
- You can use AWS managed policy called AWSLambdaExecute (Policy ARN # arn:aws:iam::aws:policy/AWSLambdaExecute)
- sample_sqs_policy_source_and_target_accounts
- sample_sns_policy_source_account
- sample_glue_policy_source_account
- sample_ddb_policy_source_and_target_accounts
-
Deploy GDCReplicationPlannerLambda function
- Runtime = Java 8
- Function package = Use the Jar file generated. Refer section Build Instructions
- Lambda Handler =
com.amazonaws.gdcreplication.lambda.GDCReplicationPlanner
- Timeout = e.g. 5 minutes
- Memory = e.g. 128 MB
- Environment variable = as defined in the following table
Variable Name Variable Value source_glue_catalog_id Source AWS Account Id ddb_name_gdc_replication_planner Name of the DDB Table for glue_database_export_task of source account database_prefix_list List of database prefixes separated by a token. E.g. raw_data_,processed_data_. To export all databases, do not add this variable. separator The separator used in the database_prefix_list. E.g. ,. This can be skipped when database_prefix_list is not added. region e.g. us-east-1 sns_topic_arn_gdc_replication_planner SNS Topic ARN for ReplicationPlannerSNSTopic -
Deploy ExportLambda function
- Runtime = Java 8
- Function package = Use the Jar file generated. Refer section Build Instructions
- Lambda Handler =
com.amazonaws.gdcreplication.lambda.ExportDatabaseWithTables
- Timeout = e.g. 5 minutes
- Memory = e.g. 192 MB
- Environment variable = as defined in the following table
Variable Name Variable Value source_glue_catalog_id Source AWS Account Id ddb_name_db_export_status Name of the DDB Table for db_status of source account ddb_name_table_export_status Name of the DDB Table for table_status of source account region e.g. us-east-1 sns_topic_arn_export_dbs_tables SNS Topic ARN for SchemaDistributionSNSTopic sqs_queue_url_large_tables SQS Queue URL for LargeTableSQSQueue -
Add ReplicationPlannerSNSTopic as a trigger to ExportLambda function
-
Deploy ExportLargeTableLambda function
- Runtime = Java 8
- Function package = Use the Jar file generated. Refer section Build Instructions
- Lambda Handler =
com.amazonaws.gdcreplication.lambda.ExportLargeTable
- Timeout = e.g. 3 minutes
- Memory = e.g. 256 MB
- Environment variable = as defined in the following table
Variable Name Variable Value s3_bucket_name Name of the S3 Bucket used to save partitions for large Tables ddb_name_table_export_status Name of the DDB Table for table_status of source account region e.g. us-east-1 sns_topic_arn_export_dbs_tables SNS Topic ARN for SchemaDistributionSNSTopic -
Add LargeTableSQSQueue as a trigger to ExportLargeTableLambda function
- Batch size = 1
-
Cross-Account permissions in Source Account. Grant permissions to Target Account to subscribe to the second SNS Topic:
aws sns add-permission --label lambda-access --aws-account-id TargetAccount \ --topic-arn arn:aws:sns:us-east-1:SourceAccount:SchemaDistributionSNSTopic \ --action-name Subscribe ListSubscriptionsByTopic Receive
-
Create DynamoDB tables as defined in the following table
Table Purpose Schema Capacity db_status audit data for databases imported Partition key - db_id (String), Sort key - import_run_id (Number) On-Demand table_status audit data for tables imported Partition key - table_id (String), Sort key - import_run_id (Number) On-Demand -
Create SQS Queue
- Queue Name =
LargeTableSQSQueue
- Queue Type = Standard
- Default Visibility Timeout = e.g. 3 minutes 15 seconds. Note: It must be higher than execution timeout of ImportLargeTable Lambda Function
- Queue Name =
-
Create SQS Queue - dead letter queue processing
- Queue Name =
DeadLetterQueue
- Queue Type = Standard
- Default Visibility Timeout = e.g. 3 minutes 15 seconds
- Queue Name =
-
Create Lambda Execution IAM Role and attach it to the Lambda functions deployed in Target Account. This role needs to have multiple permissions. Refer the following IAM policies to know about required permissions:
- You can use AWS managed policy called AWSLambdaExecute (Policy ARN # arn:aws:iam::aws:policy/AWSLambdaExecute)
- sample_sqs_policy_source_and_target_accounts
- sample_glue_policy_target_account
- sample_ddb_policy_source_and_target_accounts
-
Deploy ImportLambda function
- Runtime = Java 8
- Function package = Use the Jar file generated. Refer section Build Instructions
- Lambda Handler =
com.amazonaws.gdcreplication.lambda.ImportDatabaseOrTable
- Timeout = e.g. 5 minutes
- Memory = e.g. 192 MB
- Environment variable = as defined in the following table
Variable Name Variable Value target_glue_catalog_id Target AWS Account Id ddb_name_db_import_status Name of the DDB Table for db_status of target account ddb_name_table_import_status Name of the DDB Table for table_status of target account skip_archive true region e.g. us-east-1 sqs_queue_url_large_tables SQS Queue URL for LargeTableSQSQueue dlq_url_sqs SQS Queue URL for DeadLetterQueue -
Give SchemaDistributionSNSTopic permissions to invoke Lambda function
aws lambda add-permission --function-name ImportLambda \ --source-arn arn:aws:sns:us-east-1:SourceAccount:SchemaDistributionSNSTopic \ --statement-id sns-x-account --action "lambda:InvokeFunction" \ --principal sns.amazonaws.com
-
Subscribe ImportLambda function to SchemaDistributionSNSTopic
aws sns subscribe --protocol lambda \ --topic-arn arn:aws:sns:us-east-1:SourceAccount:SchemaDistributionSNSTopic \ --notification-endpoint arn:aws:lambda:us-east-1:TargetAccount:function:ImportLambda
Additional References:
-
Deploy ImportLargeTableLambda function
- Runtime = Java 8
- Function package = Use the Jar file generated. Refer section Build Instructions
- Lambda Handler =
com.amazonaws.gdcreplication.lambda.ImportLargeTable
- Timeout = e.g. 3 minutes
- Memory = e.g. 256 MB
- Environment variable = as defined in the following table
Variable Name Variable Value target_glue_catalog_id Target AWS Account Id ddb_name_table_import_status Name of the DDB Table for table_status of target account skip_archive true region e.g. us-east-1 -
Add LargeTableSQSQueue as a trigger to ImportLargeTableLambda function
- Batch size = 1
-
Deploy DLQProcessorLambda function
- Runtime = Java 8
- Function package = Use the Jar file generated. Refer section Build Instructions
- Lambda Handler =
com.amazonaws.gdcreplication.lambda.DLQImportDatabaseOrTable
- Timeout = e.g. 3 minutes
- Memory = e.g. 192 MB
- Environment variable = as defined in the following table
Variable Name Variable Value target_glue_catalog_id Target AWS Account Id ddb_name_db_import_status Name of the DDB Table for db_status of target account ddb_name_table_import_status Name of the DDB Table for table_status of target account skip_archive true dlq_url_sqs SQS Queue URL for DeadLetterQueue region e.g. us-east-1 -
Add Dead Letter SQS Queue as a trigger to DLQProcessorLambda Lambda function
- Batch size = 1
This solution was designed around 3 main tenets, which are simplicity, scalability, and cost-effectiveness. The following are direct benefits:
- Target AWS accounts are independent allowing the solution to scale efficiently.
- The target accounts always see the latest table information.
- Light weight and dependable at scale.
- The implementation is fully customizable.
Following are the primary limitations:
- This utility is NOT intended for real-time replication. Refer section Use Case 2 - Ongoing replication to know about how to run the replication process as a scheduled job.
- This utility is NOT intended for two-way replication between AWS Accounts.
- This utility does NOT attempt to resolve database and table name conflicts which may result in undesirable behavior.
To do this, you can run GDCReplicationPlannerLambda function using a Test event in AWS Lambda console.
To do this, you can create a CloudWatch Event Rule in Source Account and add GDCReplicationPlannerLambda as its target. Refer the following AWS documentation for more details:
For databases and tables, the actions taken by import Lambdas depend on the state of Glue Data Catalog in target account. Those actions are summarized in the following table.
Input Message Type | State of Target Glue Data Catalog | Action Taken in Target Glue Data Catalog |
---|---|---|
Database | Database exist already | Skip the message |
Database | Database does not exist | Create Database |
Table | Table exist already | Update Table |
Table | Table does not exist | Create Table |
For partitions, the actions are summarized in the following table:
Partitions in Export | State in Target Glue Data Catalog | Action Taken in Target Account |
---|---|---|
Partitions DO NOT exist | Target Table has no partitions | No action taken |
Partitions DO NOT exist | Target Table has partitions | Delete current partitions |
Partitions exist | Target Table has no partitions | Create new partitions |
Partitions exist | Target Table has partitions | Delete current partitions, create new partitions |
This sample code is made available under the MIT-0 license. See the LICENSE file.