-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add user documentation for the FFI approach #1031
Conversation
Fantastic! No edits from me. I learned more about FFI and the approach reading that too. That is the exact doc I would have needed and wanted when encountering this issue organically |
Thinking out loud. Do we want to have a FAQ of some sort in the top level Unsure what other things you could touch on with the FAQ but it could be useful I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Great write up.
@kevinjqliu Since you've been working with this code closely recently, would you mind doing a review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for writing this up! This is a wealth of information 🙏
I added a few comments. Overall, this is amazing.
The Primary Issue | ||
----------------- | ||
|
||
Suppose you wish to use DataFusion and you have a custom data source that can produce tables that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a callout, whats so special about this method is that the abstraction is at the "data provider" ("TableProvider") layer. Most other projects are integrated with the Arrow ecosystem but requires passing arrow table or arrow batches around.
This here allows datafusion to integrate with a data source and produce the necessary data at the root level, which allows further optimizations and custom handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, currently pyiceberg pass data to other engines by materializing the arrow table first and then registering the table with the engine
https://github.com/apache/iceberg-python/blob/b95e792d86f551d3705c3ea6b7e9985a2a0aaf3b/pyiceberg/table/__init__.py#L1785-L1828
as performant as possible and to utilize the features of DataFusion, you may decide to write | ||
your source in Rust and then expose it through `PyO3 <https://pyo3.rs>`_ as a Python library. | ||
|
||
At first glance, it may appear the best way to do this is to add the ``datafusion-python`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the datafusion-python
crate is interesting. what is it used for outside of creating the python bindings as the datafusion
python library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we started doing the FFI work, re-exporting the datafusion-python
crate was basically they only way you could add extensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! i see this is called out in the Alternative Approach
section. Was just wondering if there was another reason I missed
.. code-block:: rust | ||
|
||
let my_provider = MyTableProvider::default(); | ||
let ffi_provider = FFI_TableProvider::new(Arc::new(my_provider), false, None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since this is the implementation details section, i think its worth calling out the ability to passing tokio runtime and why we would want to do that
apache/datafusion#13937
The FFI Approach | ||
---------------- | ||
|
||
Rust supports interacting with other programming languages through it's Foreign Function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's
should be its
In addition to using these interfaces to transfer Arrow data between libraries, ``pyarrow`` | ||
goes one step further to make sharing the interfaces easier in Python. They do this | ||
by exposing PyCapsules that contain the expected functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to note that this isn't just a pyarrow implementation detail; it's a full specification for Python, of which pyarrow is just one implementor https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, pyarrow didn't used to use capsules; they use to use raw integers in the _export_to_c
and _import_from_c
methods, where those integers refer to data pointers.
But this can be unsafe/lead to memory leaks if those pointers aren't always consumed, because then the Arrow release callback may never be called.
The added benefit and safety of the capsule is specifically because the capsule allows you to define a destructor, so that the destructor will be called by the Python GC if the capsule is never used. So you can't get memory leaks when using capsules.
let ffi_provider = FFI_TableProvider::new(Arc::new(my_provider), false, None); | ||
|
||
If you were interfacing with a library that provided the above ``FFI_TableProvider`` and | ||
you needed to turn it back into an ``TableProvider``, you can turn it into a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
into an
should be into a
.. code-block:: rust | ||
|
||
let name = CString::new("datafusion_table_provider")?; | ||
let my_capsule = PyCapsule::new_bound(py, provider, Some(name))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you really want to use PyCapsule::new_with_destructor
so that you don't get memory leaks if the capsule isn't used.
Or maybe PyCapsule::new
will automatically call the Drop
impl and you don't need to manually call new_with_destructor
? I can't remember
.. code-block:: rust | ||
|
||
let capsule = capsule.downcast::<PyCapsule>()?; | ||
let provider = unsafe { capsule.reference::<FFI_TableProvider>() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO you should always check the name of the capsule before opening it. https://github.com/kylebarron/arro3/blob/a041de6bae703a2c8fa3e3b31cd863d4801101d2/pyo3-arrow/src/ffi/from_python/utils.rs#L48
Your FFI interface should also define a name that all implementations should attach to the pycapsule.
As we discussed, this is not guaranteed to work across different compiler versions and | ||
optimization levels. If you wish to go down this route, there are two approaches we | ||
have identified you can use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO I wouldn't even suggest this because this is explicitly recommended against by the pyo3 maintainers: PyO3/pyo3#1444
Which issue does this PR close?
Closes #1027
Rationale for this change
User requested.
What changes are included in this PR?
Adds a page to the online documentation.
Are there any user-facing changes?
None