-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35289: [Python] Support large variable width types in numpy conversion #36701
base: main
Are you sure you want to change the base?
Conversation
|
Thank you for the contribution @Kimahriman ! The PR looks good! In the C++ part, where I am a novice also - but, I would try with creating a helper Status function and see if I can make it work with both builder instances or not (am not really sure it will work). As for the python test, it could also be consolidated with arrow/python/pyarrow/tests/test_array.py Line 635 in 9ecd0f2
|
In the Visit functions, I think the main (only?) difference is the type of |
Thanks for the ideas, I'll see if I can figure anything out 😅 |
Ok I learned about C++ templates and deduped the different functions, and I just parametrized the existing python test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! The PR looks good IMO +1.
@jorisvandenbossche could you give one more look before merging?
@jorisvandenbossche gentle ping |
@jorisvandenbossche gentle ping again. Merged in master and resolved the conflict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
And apologies for the very slow follow-up ..
I have one more request: could you add a simple test for this as well (the large_memory test you edited it good to have, but in most our CI builds those are skipped, so it would be good to have a simple small test as well that is run everywhere).
For example there is a test_array_from_numpy_ascii
and test_array_from_numpy_unicode
. You could add one case to those tests with specifying the type as the large variant.
template <typename T> | ||
Status VisitString(T* builder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nitpick, but could you move those declarations of the helpers into protected:
(eg just below VisitNative
definition)
(I don't think anyone outside of pyarrow is using this, but just to keep it consistent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the VisitBinary
and VisitString
declarations specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly as what you did
Parameterized those tests to specify the type and use regular and large types |
python/pyarrow/tests/test_array.py
Outdated
@@ -2355,32 +2355,33 @@ def test_array_from_numpy_timedelta_incorrect_unit(): | |||
pa.array(data) | |||
|
|||
|
|||
def test_array_from_numpy_ascii(): | |||
@pytest.mark.parametrize('binary_type', [pa.binary(), pa.large_binary()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, one last comment (promised! ;)): I think we want to keep testing the below also in the case of not specifying the type (in which case we infer the small types), and I am not entirely sure this is explicitly covered elsewhere (implicitly for sure). But so could parametrize this slightly differently with:
@pytest.mark.parametrize('binary_type', [pa.binary(), pa.large_binary()]) | |
@pytest.mark.parametrize('typ, expected_type', [(None, pa.binary()), (pa.binary(), pa.binary()), (pa.large_binary(), pa.large_binary())]) |
And then the same for string test below.
Or, maybe simpler, just duplicate the first case to have a version without a type specified:
# without specified type, always binary
arrow_arr = pa.array(arr)
assert arrow_arr.type == 'binary'
expected = ..
arrow_arr = pa.array(arr, binary_type)
assert arrow_arr.type == binary_type
expected = ..
(I assume that for the inference it shouldn't matter if there are strides or not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I thought about that as I was making the update hah, will add back the inferring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a third parameter and just set the expected type in the func
@pytest.mark.parametrize('string_type', [None, pa.utf8(), pa.large_utf8()])
def test_array_from_numpy_unicode(string_type):
# Default when no type is specified should be utf8
expected_type = string_type or pa.utf8()
Rationale for this change
Add support for LargeBinaryType and LargeStringType in NumPy to Arrow conversion.
What changes are included in this PR?
Adds new
Visit
methods inNumPyConverter
forLargeStringType
andLargeBinaryType
. These are mostly copy-pastes of the non-large methods, just without the chunking. Since the chunking is specifically for getting around the 2GiB limit of non-large variable width types, it didn't seem like it made sense to build a chunked builder for large types. I also had to create a copy ofAppendUTF32
. If there's a way to consolidate the common code let me know, this is my first C++ in a long time.Also adds the Arrow -> NumPy type map for the large binary types.
Are these changes tested?
New test added showing a string and binary array over 2 GiB is valid and still a single chunk. Also added the new Arrow to NumPy maps to a schema test.
Are there any user-facing changes?
Adds support for converting NumPy string/binary lists to large binary types.