-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading multi-source compressed JSONL files #17161
base: branch-24.12
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the approach!
Some suggestions, main one is the "memoization" in the new source type.
Status update: Implementing benchmark for compressed JSONL files in #17219 |
Plot of throughput of Benchmark raw output: <style> </style>
Without the custom data source for compressed inputs, the parser errors out for all of the above data sizes. I suspect that the heuristic for compressed inputs in the previous implementation falls short without the retry policy, but some more investigation is required to understand the failure, especially for the smaller data sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't reviewed the new decompress_snappy
as I'm not sure its needed. Otherwise looks good.
Description
Fixes #17068
Fixes #12299
This PR introduces a new datasource for compressed inputs which enables batching and byte range reading of multi-source JSONL files using the reallocate-and-retry policy. Moreover. instead of using a 4:1 compression ratio heuristic, the device buffer size is estimated accurately for GZIP, ZIP, and SNAPPY compression types. For remaining types, the files are first decompressed then batched.
TODO: Reuse existing JSON tests but with an additional compression parameter to verify correctness.Handled by #17219, which implements compressed JSON writer required for the above test.
Checklist