-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support s3_data_dir and s3_data_naming #50
base: master
Are you sure you want to change the base?
Conversation
b1d1b24
to
44d4ab6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some manual testing- looks good to me 👍 Looking forward to use it!
What are the contribution guidelines here? Can we merge to main?
Worth noting that if using a static external location (ie, Either reverting #43 or merging #49 should fix this. |
44d4ab6
to
89d1830
Compare
Rebased on 1.0.1 |
That looks exactly what I was looking for. We want to move to a similar S3 structure, where we have a query bucket (s3_staging_dir) and a data bucket. With a lot of dbt deploys, the S3 bucket grows a lot because of the new UUID folders which we want to clean-up and set lifecycle policy rules on both the query results and parquet data. |
@Tomme is there anything preventing this from being merged? It does seem like a very useful addition 🙂 |
Not sure if it helps if I comment to gain visibility on this approval. But my team is also waiting on this. If there's anything I can do to help I'd love to help. Edit I did run it ( |
I'd also be interested in having this PR merged. |
This PR is currently being review at dbt-labs/dbt-athena#4 |
This PR and its refactor is being merged in dbt-labs/dbt-athena#39 and available in dbt-athena-community==1.3.2 |
Problem
Currently, the only options for determining where data ends up in S3 are to set
s3_staging_bucket
in the connection properties or setexternal_location
on each model.The
s3_staging_bucket
argument is ignored if the Athena workgroup already has a staging bucket configured (but is used bydbt seed
, which always sets a location).We have a (not uncommon?) layout of one S3 bucket for the results of all Athena queries from which objects are rapidly expired, and another with appropriate lifecycle configuration for storing created tables we want to keep.
Aside from setting
external_location
on every model (which isn't so easily composable across different profiles), there isn't a nice way to control S3 layout indbt-athena
as it stands.Implementation
This adds two new (optional) connection options:
s3_data_dir
: if set, the default root directory to create tables and seeds in (eg,s3://my-data-bucket/dbt/
)s3_data_naming
: the strategy for naming subdirectories ofs3_data_dir
in which we'll actually store tables; two implemented options:uuid
: tables or seeds are saved to{s3_data_dir}/{uuid4}/
(this was how seed tables were already named)schema_table
: name according to the schema (ie, Athena database) and table like{s3_data_dir}/{schema}/{table}/
If
s3_data_dir
is unset, the behaviour should be unchanged:{s3_staging_dir}/tables/{uuid4}/
external_location
if set, and Athena's default otherwise.