duck-query-lambda
is a custom AWS Lambda Layer that provides a runtime for executing DuckDB queries programmatically.
Since the S3 Select feature has ended, this function provides an alternative for querying data stored in S3 (any other locations) using DuckDB.
This layer allows you to run DuckDB queries without writing any code or deploying anything other than a simple function based on this runtime. This makes it ideal for use in things like AWS Step Functions for tasks such as querying, data transformation, and more.
- 🧪 Experimental
- 📱 Arm64 only
- Custom Runtime: Provides a custom runtime for AWS Lambda to execute DuckDB queries.
- Data Transformation: Perform complex data transformations using SQL queries.
- Integration with Step Functions: Easily integrate with AWS Step Functions for orchestrating data workflows.
The Lambda Layer for this DuckDB runtime is available in the AWS Serverless Application Repository. You can deploy it directly from the AWS Management Console or using the AWS CLI.
- Install from the AWS Console: https://serverlessrepo.aws.amazon.com/applications/eu-west-1/949339270388/duck-query-lambda
- Install using AWS SAM or CloudFormation:
duckquerylambda:
Type: AWS::Serverless::Application
Properties:
Location:
ApplicationId: arn:aws:serverlessrepo:eu-west-1:949339270388:applications/duck-query-lambda
SemanticVersion: 0.0.8 # x-release-please-version
- Install using the AWS CDK:
import * as sam from "aws-cdk-lib/aws-sam";
...
new sam.CfnApplication(this, "DuckQueryRuntimeLayer", {
location: {
applicationId: "arn:aws:serverlessrepo:eu-west-1:949339270388:applications/duck-query-lambda",
semanticVersion: "0.0.8", // x-release-please-version
},
});
An example SAM project can be found in the examples/
directory.
You don't need to write any code to use the DuckDB runtime. You can create a Lambda function that uses the runtime, give it some IAM permissions and then invoke it with a query.
- AWS CLI
- AWS SAM CLI
- Docker (for building the Lambda Layer)
-
Clone the repository:
git clone https://github.com/fourTheorem/duck-query-lambda.git cd duck-query-lambda
-
Build the Lambda Layer:
sam build
-
Package the Lambda Layer:
sam package --output-template-file packaged.yaml --s3-bucket <your-s3-bucket>
-
Deploy the Lambda Layer:
sam deploy --template-file packaged.yaml --stack-name duck-query-lambda --capabilities CAPABILITY_IAM
-
Add the Lambda Layer to your Lambda function (AWS SAM example):
Resources: MyLambdaFunction: Type: AWS::Serverless::Function Properties: Handler: bootstrap Runtime: provided Architecture: arm64 Layers: - arn:aws:lambda:<region>:<account-id>:layer:duck-query-lambda:<version>
-
Invoke the Lambda function with a query:
{ "query": "SELECT * FROM 's3://bucket/table.parquet'" }
Here is an example of how to use the Lambda Layer in an AWS Step Function:
{
"StartAt": "Convert Parquet to JSON",
"States": {
"RunQuery": {
"Type": "Task",
"Resource": "arn:aws:lambda:<region>:<account-id>:function:MyLambdaFunction",
"Parameters": {
"query": "COPY (SELECT * FROM 's3://bucket/table.parquet) TO 's3://bucket/output/result.json' (ARRAY)'",
},
"End": true
}
}
}
By default, the Lambda function will not return the query results. This is because it's not trivial to convert all results back to JSON in a way that every user expects. If you do want to get the results back synchronously, you can write them to a temporary file in the Lambda and the function will then return the contents of that file, base64 encoded.
Here is an example of how to do this:
{
"query": "COPY (SELECT * FROM 'https://github.com/Teradata/kylo/raw/refs/heads/master/samples/sample-data/parquet/userdata1.parquet' LIMIT 10) TO '/tmp/output.json'",
"outputFile": "/tmp/output.json"
}