Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rotating Saved Files #146

Open
3 tasks
mavam opened this issue Jan 3, 2024 · 1 comment
Open
3 tasks

Rotating Saved Files #146

mavam opened this issue Jan 3, 2024 · 1 comment
Labels
connector Loader and saver improvement An incremental enhancement of an existing feature

Comments

@mavam
Copy link
Member

mavam commented Jan 3, 2024

When processing large streams of files, we need the ability to cut them at a specific point. Today, users have to rely on thirty tools for this, such as logrotate. There are two obvious ways to do this:

  • Spatially: after a file reached a given size
  • Temporally: after a fixed time duration

Tasks

@mavam mavam changed the title Rotation of saved files Rotating saved files Jan 3, 2024
@mavam mavam added connector Loader and saver improvement An incremental enhancement of an existing feature labels Jan 3, 2024
@mavam
Copy link
Member Author

mavam commented Jan 5, 2024

DuckDB does partitioned writes with Hive partitioning, and also supports reading such partitions as follows:

SELECT * FROM read_parquet('orders/*/*/*.parquet', hive_partitioning = 1);

Users most likely don't have to even worry about this:

By default the system tries to infer if the provided files are in a hive partitioned hierarchy. And if so, the hive_partitioning flag is enabled automatically. The autodetection will look at the names of the folders and search for a 'key' = 'value' pattern. This behaviour can be overridden by setting the hive_partitioning flag manually.

Given that the Hive partitioning is a quasi cross-tool standard in the data community, and that many data tools support it OOTB, we should start with this approach, as it maximizes interoperability and simplicity. For example, Arrow also supports reading partitioned datasets.

@dominiklohmann dominiklohmann changed the title Rotating saved files Rotating Saved Files Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
connector Loader and saver improvement An incremental enhancement of an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant