-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support for Arrow PyCapsule interface #23
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,13 +6,16 @@ | |
|
||
import anywidget | ||
import duckdb | ||
import pyarrow as pa | ||
import traitlets | ||
|
||
from ._util import ( | ||
arrow_table_from_dataframe_protocol, | ||
arrow_table_from_ipc, | ||
get_columns, | ||
has_pycapsule_stream_interface, | ||
is_arrow_ipc, | ||
is_dataframe_api_obj, | ||
table_to_ipc, | ||
) | ||
|
||
|
@@ -37,10 +40,22 @@ def __init__(self, data, *, table: str = "df"): | |
conn = data | ||
else: | ||
conn = duckdb.connect(":memory:") | ||
if is_arrow_ipc(data): | ||
if has_pycapsule_stream_interface(data): | ||
# NOTE: for now we materialize the input into an in-memory Arrow table, | ||
# so that we can perform repeated queries on that. In the future, it may | ||
# be better to keep this Arrow stream non-materalized in Python and | ||
# create a new DuckDB table from the stream. | ||
# arrow_table = pa.RecordBatchReader.from_stream(data) | ||
arrow_table = pa.table(data) | ||
Comment on lines
+48
to
+49
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here this is not ideal because If I uncomment the Note that it shows 11k rows in the bottom right, but then doesn't display anything. So I think you need to persist the Arrow stream input manually. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I figured this is not ideal... I just wasn't sure how to wire up duckdb to query something consistently. I'm guessing something like "CREATE VIEW" from the input stream might allow us register some tables that duckdb can continuously query. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not 100% certain but I don't think you could always use I think we're running into the same issue as in ibis: ibis-project/ibis#9663 (comment) . In particular, the presence of the dunder method doesn't tell you whether the input data is already materialized or a stream. If you know that the input data is an in-memory table, then you can call |
||
elif is_arrow_ipc(data): | ||
arrow_table = arrow_table_from_ipc(data) | ||
else: | ||
elif is_dataframe_api_obj(data): | ||
arrow_table = arrow_table_from_dataframe_protocol(data) | ||
else: | ||
raise ValueError( | ||
"input must be a DuckDB connection, DataFrame-like, an Arrow IPC " | ||
"table, or an Arrow object exporting the Arrow C Stream interface." | ||
) | ||
conn.register(table, arrow_table) | ||
self._conn = conn | ||
super().__init__( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if the DuckDB
:memory:
DB will spill to disk? Or for very large input would it just crash the process?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it spills to disk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I was talking to someone recently who asserted
:memory:
in particular didn't spill to disk because it doesn't have a path to store a local database. But I'm not sureThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi just stopping by;
in-memory databases will spill to disk, it will create/use the
.tmp
folder in the current working directory, a database file is not required to spill to disk.You can test this out by setting a relatively low memory limit and creating a temp table that exceeds this limit.
Using
select * from duckdb_temporary_files()
will show you which temporary files were created to back this temporary table