Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Arrow PyCapsule Interface #2504

Open
kylebarron opened this issue Jul 11, 2024 · 3 comments
Open

Support for Arrow PyCapsule Interface #2504

kylebarron opened this issue Jul 11, 2024 · 3 comments

Comments

@kylebarron
Copy link

kylebarron commented Jul 11, 2024

Is your feature request related to a problem? Please describe.

The Arrow PyCapsule Interface is a new spec to simplify Arrow interop between compiled Python libraries.

For example, you have a daft.from_arrow method but this narrowly allows only pyarrow Table objects

Daft/daft/table/table.py

Lines 102 to 104 in a5badb4

@staticmethod
def from_arrow(arrow_table: pa.Table) -> Table:
assert isinstance(arrow_table, pa.Table)
. This would fail on a pyarrow.RecordBatchReader or any non-pyarrow Arrow objects, and requires a pyarrow dependency, which is quite large.

A from_arrow method that looks for the __arrow_c_stream__ method would work out of the box on any Arrow-based Python library implementing this spec. That includes:

and hopefully soon duckdb and polars objects as well (pola-rs/polars#12530).

Implementing the pycapsule interface on daft classes means that any of those other libraries would just work on daft classes.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

I've implemented import and export of the pycapsule interface for arrow objects in pyo3-arrow (in a separate crate for a few reasons).

It looks like daft is still tied to arrow2, but maybe pyo3-arrow is useful as a reference here.

@jaychia
Copy link
Contributor

jaychia commented Jul 12, 2024

Very cool. I was just chatting with @paleolimbot about whether Daft could remove its dependency on PyArrow at some point and rely on something lighter-weight to do import/export of arrow data.

Today it's technically possible (we can have PyArrow as an optional dependency for the most part, except for some file writing code where we still rely on PyArrow utilities), but definitely requiring PyArrow at all on the critical path is a heavy dependency.

If I understand correctly, there are a few issues to be fixed here:

  1. On the import side: fix from_arrow to check hasattr("__arrow_c_stream__") and use this API to import Arrow data if possible
  2. On the export side, we can add a __arrow_c_stream__ interface as well to return a stream of arrow data instead of a fully materialized arrow table

Am I understanding this correctly?

@paleolimbot
Copy link

I think so!

An example of the import side:

https://github.com/apache/arrow-nanoarrow/blob/2aa2e697fd5d0ef3cd88961bb030f9e760db581b/python/src/nanoarrow/c_array_stream.py#L69-L74

On the export side, you probably want to emulate https://github.com/kylebarron/arro3/blob/227c505714865064389d626c7fe3d79e8bd315c0/pyo3-arrow/src/array.rs#L123-L141 (or something like it using arrow2/the stream, like maybe pola-rs/r-polars#5 ).

I hope that helps!

@kylebarron
Copy link
Author

In case it's useful, I implemented the pycapsule interface for polars in pola-rs/polars#17676 and pola-rs/polars#17693. I figure polars-arrow hasn't diverged too much from arrow2, so it should be pretty similar to implement the pycapsule interface in daft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants