Forced rechunking #32

aldanor · 2023-10-06T00:22:03Z

This took me a while to figure out (since this was the last place I'd expect a forced rechunk to happen) - while passing huge frames from Python to Rust and back, noticed that they end up arriving in one chunk even if they were multi-chunked originally.

Is there any reason to not leave rechunking to the end-user? (since in some cases it may end up being very detrimental)

pyo3-polars/pyo3-polars/src/lib.rs

Line 121 in 0165cb4

let ob = ob.call_method0("rechunk")?;

... and also this:

pyo3-polars/pyo3-polars/src/lib.rs

Line 163 in 0165cb4

let s = self.0.rechunk();

aldanor · 2023-10-06T01:23:01Z

Ok, edit, after a bit of reading...

IntoPy: basically, creates a pa.Array via pa.Array._import_from_c. In this case, if we have multiple chunks, can we simply do that for each chunk and then call pa.chunked_array(chunks)?

Problem is though, pl.from_arrow() seems to squash pa.ChunkedArray somewhere along the way anyways...

FromPyObject: more important but a bit more obscure:

The implementation is relying on Series.to_arrow() which always returns contiguous result
- Because the first line in PySeries::to_arrow() calls self.rechunk(true)
DataFrame.to_arrow(), however, returns a chunked Table (!)
- But FromPyObject for PyDataFrame doesn't/can't use it and instead does it series-by-series so you end up with rechunked columns anyways
- If there's more than one chunk, can the dataframe be reconstructed as-is from record batches? (similar conversion is done in the other direction already in arrow_interop anyways)

ritchie46 · 2023-10-07T06:04:05Z

We should return ChunkedArrays's to arrow. That we don't probably was a bit of lazyness when I implemented this.

aldanor · 2023-10-09T01:33:22Z

We should return ChunkedArrays's to arrow.

To arrow or from arrow? 🙂 (i.e. IntoPy or FromPyObject?)

If I'm reading it right btw, chunked-array API is not part of arrow's stable C API, is that part of the problem here?

// Yea, in some cases, this kind of rechunking may be catastrophic, e.g. if your dataframes are 50-100 GB, rechunk is the last thing you want to happen behind the scenes...

ritchie46 · 2023-10-09T13:19:18Z

But we could return a list of arrow arrays. 🤔 And then even use that to create a pyarrow ChunkedArray.

aldanor · 2023-10-09T14:44:37Z

Yea, I think that should work.

Also a question then whether a single-chunk case should be special-cased or not (should it yield a list of one and produce a chunked array with a single batch, or a plain array)

ritchie46 · 2023-10-17T05:26:48Z

I think we can add a rechunk parameter and return an array if rechunk=True and otherwise always a list of arrays.

aldanor · 2023-10-17T22:34:49Z

That sounds reasonable. The default being no rechunking?

ritchie46 · 2023-10-18T07:57:01Z

That sounds reasonable. The default being no rechunking?

Yes. Default to not exploding your memory. 🙈

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forced rechunking #32

Forced rechunking #32

aldanor commented Oct 6, 2023

aldanor commented Oct 6, 2023 •

edited

Loading

ritchie46 commented Oct 7, 2023

aldanor commented Oct 9, 2023

ritchie46 commented Oct 9, 2023

aldanor commented Oct 9, 2023

ritchie46 commented Oct 17, 2023

aldanor commented Oct 17, 2023

ritchie46 commented Oct 18, 2023

Forced rechunking #32

Forced rechunking #32

Comments

aldanor commented Oct 6, 2023

aldanor commented Oct 6, 2023 • edited Loading

ritchie46 commented Oct 7, 2023

aldanor commented Oct 9, 2023

ritchie46 commented Oct 9, 2023

aldanor commented Oct 9, 2023

ritchie46 commented Oct 17, 2023

aldanor commented Oct 17, 2023

ritchie46 commented Oct 18, 2023

aldanor commented Oct 6, 2023 •

edited

Loading