-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Converting data frame to Table with large nested column fails Invalid Struct child array has length smaller than expected
#32439
Comments
Joris Van den Bossche / @jorisvandenbossche: |
hadim:
Is there is a workaround waiting for a fix? |
Joris Van den Bossche / @jorisvandenbossche: N = 100_000 # chunksize
pa.Table.from_batches([pa.record_batch(df.iloc[i:i+N]) for i in range(0, len(df), N)]) |
The same error appears when converting a List of Dicts in Python (JSON data) to a I will test the workaround now, is there any possibility that this will be fixed anytime soon? Workaround in my case is quite similar: pa.Table.from_batches(
batches=[
pa.record_batch(data=messages[i : i + chunk_size], schema=schema) # noqa E203
for i in range(0, len(messages), chunk_size)
],
Schema_schema=schema,
) |
The issues is going to happen anytime a single string column ends up with more than 2^31 characters. So, in OPs reproduction the column This "breaking unexpectedly large columns into chunks" behavior is rather tricky and it appears we are doing something wrong when working with lists of struct arrays. Here's a compact reproducer (that only has 3 rows):
It seems the struct array has length 3. Meanwhile, it's child, the string array, has length 2 (because it had to be broken into 2 chunks. The first chunk has the first 2 values and the second chunk has the third). So if someone wanted to investigate this I would recommend starting by looking at the conversion from pandas code and see how the struct array and list arrays are handling the case where their children is converted into multiple chunks. |
👀 Any updates for this issue? |
The workaround works, but it definitely slows things down |
I think the root cause is here:
In the blanket implementation of For e.g. The resulting I did a janky sanity test to make I'm guessing this will be the same for all nested types. Hence, we get the validation error which probably isn't expecting I'm not sure what the best path forward is here. It's not clear to me:
@westonpace can you advise if this sounds like the right place, and what next steps should look like? EDIT: Actually I think I can make it work by overriding |
If memory serves this is correct (I am not quite 100% certain of this paragraph but "fairly confident" :) maybe someone else will confirm) and intended behavior. The length of a parent array may be shorted than the length of a child array. Offsets are cumulative so if a parent array has offset 10 and a child array has offset 5 then the memory location that is looked at for the 0th value will be the 15th slot.
I'm not very familiar with the python converters. However, I think we have a good amount of regression tests. If you can get them to pass while passing this case then you are probably on the right track. |
…37376) ### Rationale for this change See: #32439 ### What changes are included in this PR? During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children. This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other. This is per: https://github.com/apache/arrow/blob/86b7a84c9317fa08222eb63f6930bbb54c2e6d0b/cpp/src/arrow/array/builder_nested.h#L507-L508 ### Are these changes tested? A unit test is added that would previously have an invalid data error. ``` > tab = pa.Table.from_pandas(df) pyarrow/tests/test_pandas.py:4970: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas return cls.from_arrays(arrays, schema=schema) pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays result.validate() pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate check_status(self.table.Validate()) # ... FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3) ``` NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it. ### Are there any user-facing changes? No * Closes: #32439 Lead-authored-by: Mike Lui <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…cts (apache#37376) ### Rationale for this change See: apache#32439 ### What changes are included in this PR? During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children. This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other. This is per: https://github.com/apache/arrow/blob/86b7a84c9317fa08222eb63f6930bbb54c2e6d0b/cpp/src/arrow/array/builder_nested.h#L507-L508 ### Are these changes tested? A unit test is added that would previously have an invalid data error. ``` > tab = pa.Table.from_pandas(df) pyarrow/tests/test_pandas.py:4970: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas return cls.from_arrays(arrays, schema=schema) pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays result.validate() pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate check_status(self.table.Validate()) # ... FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3) ``` NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it. ### Are there any user-facing changes? No * Closes: apache#32439 Lead-authored-by: Mike Lui <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…cts (apache#37376) ### Rationale for this change See: apache#32439 ### What changes are included in this PR? During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children. This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other. This is per: https://github.com/apache/arrow/blob/86b7a84c9317fa08222eb63f6930bbb54c2e6d0b/cpp/src/arrow/array/builder_nested.h#L507-L508 ### Are these changes tested? A unit test is added that would previously have an invalid data error. ``` > tab = pa.Table.from_pandas(df) pyarrow/tests/test_pandas.py:4970: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas return cls.from_arrays(arrays, schema=schema) pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays result.validate() pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate check_status(self.table.Validate()) # ... FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3) ``` NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it. ### Are there any user-facing changes? No * Closes: apache#32439 Lead-authored-by: Mike Lui <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…cts (apache#37376) ### Rationale for this change See: apache#32439 ### What changes are included in this PR? During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children. This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other. This is per: https://github.com/apache/arrow/blob/86b7a84c9317fa08222eb63f6930bbb54c2e6d0b/cpp/src/arrow/array/builder_nested.h#L507-L508 ### Are these changes tested? A unit test is added that would previously have an invalid data error. ``` > tab = pa.Table.from_pandas(df) pyarrow/tests/test_pandas.py:4970: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas return cls.from_arrays(arrays, schema=schema) pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays result.validate() pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate check_status(self.table.Validate()) # ... FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3) ``` NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it. ### Are there any user-facing changes? No * Closes: apache#32439 Lead-authored-by: Mike Lui <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Hey,
I have a data frame for which one column is a nested struct array. Converting it to a pyarrow.Table fails if the data frame gets too big. I could reproduce the bug with a minimal example with anonymized data that is roughly similar to mine. When I set, e.g., N_ROWS=500_000, or smaller, it is working fine.
The length is always smaller than expected by 1.
Expected behavior:
Run without errors or fail with a better error message.
System Info and Versions:
Apple M1 Pro but also happened on amd64 Linux machine on AWS
I could also reproduce with
Reporter: Simon Weiß
Related issues:
Invalid Struct child array has length smaller than expected
(is cloned by)Note: This issue was originally created as ARROW-17137. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: