-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): IO plugins #17939
feat(python): IO plugins #17939
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17939 +/- ##
==========================================
- Coverage 80.35% 80.32% -0.04%
==========================================
Files 1492 1494 +2
Lines 196330 196453 +123
Branches 2813 2817 +4
==========================================
+ Hits 157764 157802 +38
- Misses 38045 38131 +86
+ Partials 521 520 -1 ☔ View full report in Codecov by Sentry. |
Alright, major bump it is. 👍 |
will this eventually replace anonymous_scan? with a streaming support |
Would this be the approach for a native delta table reader as well? An IO plugin? |
This sounds very interesting. I have a project called gribtoarrow which uses c++ and pybind11 to read a grib file (World Meteorology Organisation) format (basically same a CDF file - any anything more of a standard than CDF).... Internally it uses an iterator and I currently use it with polars in a loop, do you have an example of using the IO Plugin ? |
Yes, this would work for that. I am looking for a good example source. It might be native delta.
For rust
Why? :/ |
On the version, long story and not in my control. For a source have you considered GDAL https://gdal.org/index.html ? This enables a huge amount of files to be read, including most of the scientific formats such as HDF5, Grib and zarr. In fact GADL is used under the hood in the mosaic library to allow spark to read such formats. |
This sets up all the architecture for IO plugins. Different from expression plugins these won't go over FFI directly, but will use python as intermediary. This is fine as this IO can release the GIL and do plenty of work before it needs it again.
The IO sources will be consumed by the batch engine and the streaming engine. The sources can act as a python generator yielding new dataframes. They will accept
columns: str | None
to apply projections,predicate: Expr | None
to apply predicates andslice: tuple[int, int] | None
to apply slices/early stopping.After this is in, I will follow up with a hdf5 native reader.
The goal is to support many more file formats as separate wheels. This way Polars can support many IO formats in an efficient manner without worrying about binary size.