-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-32439: [Python] Fix off by one bug when chunking nested structs #37376
Conversation
|
69e91f1
to
40ed108
Compare
Failures are due to high memory from the unit test 😮💨 |
Lowered the size of the binary array to be within mem limits (barely scraping by 🥹) 720,000,000 * 3 = 2,160,000,000, which triggers chunking over 2,147,483,647 edit: aaarg Python 3.8 w/ Pandas 1.0 works, while Python 3.10 with Pandas latest hits mem limits. This test is specifically for cases with high memory and chunking requirements. I'm open to other ideas for testing |
EDIT: Realized I can mark tests as large memory. The test that exposed this is: arrow/python/pyarrow/tests/test_convert_builtin.py Lines 2293 to 2319 in c9bfca5
|
I’m hitting this error too, any chance this PR could be merged? |
@westonpace since you seemed to be PoC on the GH issue, can you take a look or direct us to the right person to review this? It seems there's a bit of a backlog in reviewing [Python] PR's, anyway. |
@AlenkaF @wjones127 There are some backlog PRs is there any chance of getting this reviewed? |
Sorry for slow response from us @mikelui , thank you for keeping it going! PS: The mark for the tests with large memory is the correct way to go 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
The solution is elegant IMO. The tests added are also giving good coverage. I would only have another sanity check of the C++ change from somebody else and then I am happy to merge!
cc @westonpace @wjones127 @mapleFU @pitrou can someone take a look here? 🥹 |
During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children. This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed small test improvements and rebased from latest git main. Thanks for this @mikelui !
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 8cdce28. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…cts (apache#37376) ### Rationale for this change See: apache#32439 ### What changes are included in this PR? During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children. This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other. This is per: https://github.com/apache/arrow/blob/86b7a84c9317fa08222eb63f6930bbb54c2e6d0b/cpp/src/arrow/array/builder_nested.h#L507-L508 ### Are these changes tested? A unit test is added that would previously have an invalid data error. ``` > tab = pa.Table.from_pandas(df) pyarrow/tests/test_pandas.py:4970: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas return cls.from_arrays(arrays, schema=schema) pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays result.validate() pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate check_status(self.table.Validate()) # ... FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3) ``` NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it. ### Are there any user-facing changes? No * Closes: apache#32439 Lead-authored-by: Mike Lui <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…cts (apache#37376) ### Rationale for this change See: apache#32439 ### What changes are included in this PR? During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children. This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other. This is per: https://github.com/apache/arrow/blob/86b7a84c9317fa08222eb63f6930bbb54c2e6d0b/cpp/src/arrow/array/builder_nested.h#L507-L508 ### Are these changes tested? A unit test is added that would previously have an invalid data error. ``` > tab = pa.Table.from_pandas(df) pyarrow/tests/test_pandas.py:4970: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas return cls.from_arrays(arrays, schema=schema) pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays result.validate() pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate check_status(self.table.Validate()) # ... FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3) ``` NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it. ### Are there any user-facing changes? No * Closes: apache#32439 Lead-authored-by: Mike Lui <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…cts (apache#37376) ### Rationale for this change See: apache#32439 ### What changes are included in this PR? During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children. This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other. This is per: https://github.com/apache/arrow/blob/86b7a84c9317fa08222eb63f6930bbb54c2e6d0b/cpp/src/arrow/array/builder_nested.h#L507-L508 ### Are these changes tested? A unit test is added that would previously have an invalid data error. ``` > tab = pa.Table.from_pandas(df) pyarrow/tests/test_pandas.py:4970: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas return cls.from_arrays(arrays, schema=schema) pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays result.validate() pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate check_status(self.table.Validate()) # ... FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3) ``` NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it. ### Are there any user-facing changes? No * Closes: apache#32439 Lead-authored-by: Mike Lui <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
See: #32439
What changes are included in this PR?
During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children.
This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other.
This is per:
arrow/cpp/src/arrow/array/builder_nested.h
Lines 507 to 508 in 86b7a84
Are these changes tested?
A unit test is added that would previously have an invalid data error.
NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it.
Are there any user-facing changes?
No
Invalid Struct child array has length smaller than expected
#32439