GH-35289: [Python] Support large variable width types in numpy conversion #36701

Kimahriman · 2023-07-15T12:18:28Z

Rationale for this change

Add support for LargeBinaryType and LargeStringType in NumPy to Arrow conversion.

What changes are included in this PR?

Adds new Visit methods in NumPyConverter for LargeStringType and LargeBinaryType. These are mostly copy-pastes of the non-large methods, just without the chunking. Since the chunking is specifically for getting around the 2GiB limit of non-large variable width types, it didn't seem like it made sense to build a chunked builder for large types. I also had to create a copy of AppendUTF32. If there's a way to consolidate the common code let me know, this is my first C++ in a long time.

Also adds the Arrow -> NumPy type map for the large binary types.

Are these changes tested?

New test added showing a string and binary array over 2 GiB is valid and still a single chunk. Also added the new Arrow to NumPy maps to a schema test.

Are there any user-facing changes?

Adds support for converting NumPy string/binary lists to large binary types.

Closes: [Python] Converting from NumPy to large_string or large_binary returns not implemented #35289

github-actions · 2023-07-15T12:18:55Z

⚠️ GitHub issue #35289 has been automatically assigned in GitHub to PR creator.

AlenkaF · 2023-08-22T05:47:19Z

Thank you for the contribution @Kimahriman !

The PR looks good!
As you have mentioned, there might be some improvement possible with consolidating the common code.

In the C++ part, where I am a novice also - but, I would try with creating a helper Status function and see if I can make it work with both builder instances or not (am not really sure it will work).

As for the python test, it could also be consolidated with test_numpy_binary_overflow_to_chunked with parametrizing the test function as for example in

arrow/python/pyarrow/tests/test_array.py

Line 635 in 9ecd0f2

@pytest.mark.parametrize('list_type_factory', [pa.list_, pa.large_list])

.

python/pyarrow/src/arrow/python/numpy_to_arrow.cc

jorisvandenbossche · 2023-08-22T09:28:21Z

If there's a way to consolidate the common code let me know, this is my first C++ in a long time.

In the Visit functions, I think the main (only?) difference is the type of builder that was created? In that case I think it should be possible to template this: have a single VisitString() that is templated on the builder type, and then the Visit(const StringType& type) could be a small wrapper around calling VisitString<ChunkedStringBuilder>(). Although the builder having different parameters to instantiate it might complicate things ..

Kimahriman · 2023-08-22T10:56:56Z

Thanks for the ideas, I'll see if I can figure anything out 😅

Kimahriman · 2023-08-23T16:28:18Z

Ok I learned about C++ templates and deduped the different functions, and I just parametrized the existing python test

AlenkaF

Great work! The PR looks good IMO +1.
@jorisvandenbossche could you give one more look before merging?

Kimahriman · 2023-11-16T18:36:15Z

@jorisvandenbossche gentle ping

Kimahriman · 2024-05-21T15:35:58Z

@jorisvandenbossche gentle ping again. Merged in master and resolved the conflict

jorisvandenbossche

Looks good!

And apologies for the very slow follow-up ..

I have one more request: could you add a simple test for this as well (the large_memory test you edited it good to have, but in most our CI builds those are skipped, so it would be good to have a simple small test as well that is run everywhere).
For example there is a test_array_from_numpy_ascii and test_array_from_numpy_unicode. You could add one case to those tests with specifying the type as the large variant.

jorisvandenbossche · 2024-05-22T12:30:43Z

python/pyarrow/src/arrow/python/numpy_to_arrow.cc

+  template <typename T>
+  Status VisitString(T* builder);


Small nitpick, but could you move those declarations of the helpers into protected: (eg just below VisitNative definition)

(I don't think anyone outside of pyarrow is using this, but just to keep it consistent)

Just the VisitBinary and VisitString declarations specifically?

Yes, exactly as what you did

Kimahriman · 2024-05-22T17:31:35Z

Looks good!

And apologies for the very slow follow-up ..

I have one more request: could you add a simple test for this as well (the large_memory test you edited it good to have, but in most our CI builds those are skipped, so it would be good to have a simple small test as well that is run everywhere). For example there is a test_array_from_numpy_ascii and test_array_from_numpy_unicode. You could add one case to those tests with specifying the type as the large variant.

Parameterized those tests to specify the type and use regular and large types

jorisvandenbossche · 2024-05-23T08:05:08Z

python/pyarrow/tests/test_array.py

@@ -2355,32 +2355,33 @@ def test_array_from_numpy_timedelta_incorrect_unit():
            pa.array(data)


-def test_array_from_numpy_ascii():
+@pytest.mark.parametrize('binary_type', [pa.binary(), pa.large_binary()])


OK, one last comment (promised! ;)): I think we want to keep testing the below also in the case of not specifying the type (in which case we infer the small types), and I am not entirely sure this is explicitly covered elsewhere (implicitly for sure). But so could parametrize this slightly differently with:

Suggested change

@pytest.mark.parametrize('binary_type', [pa.binary(), pa.large_binary()])

@pytest.mark.parametrize('typ, expected_type', [(None, pa.binary()), (pa.binary(), pa.binary()), (pa.large_binary(), pa.large_binary())])

And then the same for string test below.

Or, maybe simpler, just duplicate the first case to have a version without a type specified:

# without specified type, always binary arrow_arr = pa.array(arr) assert arrow_arr.type == 'binary' expected = .. arrow_arr = pa.array(arr, binary_type) assert arrow_arr.type == binary_type expected = ..

(I assume that for the inference it shouldn't matter if there are strides or not)

Yeah I thought about that as I was making the update hah, will add back the inferring

Added a third parameter and just set the expected type in the func

@pytest.mark.parametrize('string_type', [None, pa.utf8(), pa.large_utf8()]) def test_array_from_numpy_unicode(string_type): # Default when no type is specified should be utf8 expected_type = string_type or pa.utf8()

Kimahriman added 2 commits July 14, 2023 08:15

Support large binary types in numpy conversion

64de142

Add large types to schema test

348be4e

github-actions bot added Component: Python awaiting review Awaiting review labels Jul 15, 2023

Kimahriman mentioned this pull request Jul 15, 2023

[SPARK-39979][SQL][FOLLOW-UP] Support large variable types in pandas UDF, createDataFrame and toPandas with Arrow apache/spark#41569

Closed

pitrou requested review from AlenkaF and jorisvandenbossche July 19, 2023 07:53

jorisvandenbossche reviewed Aug 22, 2023

View reviewed changes

python/pyarrow/src/arrow/python/numpy_to_arrow.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Aug 22, 2023

Template functions to dedupe and parametrize python test

5248701

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 23, 2023

Fix test lint

51cc103

AlenkaF approved these changes Aug 24, 2023

View reviewed changes

Merge branch 'main' into nptoarrow-large-binary

6c8c425

jorisvandenbossche reviewed May 22, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 22, 2024

Changed visibility of templated methods and parameterize quick tests

27c5228

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 22, 2024

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 23, 2024

jorisvandenbossche reviewed May 23, 2024

View reviewed changes

Add back inferring to test

0149903

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-35289: [Python] Support large variable width types in numpy conversion #36701

GH-35289: [Python] Support large variable width types in numpy conversion #36701

Kimahriman commented Jul 15, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Jul 15, 2023

AlenkaF commented Aug 22, 2023 •

edited

Loading

jorisvandenbossche commented Aug 22, 2023

Kimahriman commented Aug 22, 2023

Kimahriman commented Aug 23, 2023

AlenkaF left a comment

Kimahriman commented Nov 16, 2023

Kimahriman commented May 21, 2024

jorisvandenbossche left a comment

jorisvandenbossche May 22, 2024

Kimahriman May 22, 2024

jorisvandenbossche May 23, 2024

Kimahriman commented May 22, 2024

jorisvandenbossche May 23, 2024

Kimahriman May 23, 2024

Kimahriman May 23, 2024 •

edited

Loading

	@pytest.mark.parametrize('binary_type', [pa.binary(), pa.large_binary()])
	@pytest.mark.parametrize('typ, expected_type', [(None, pa.binary()), (pa.binary(), pa.binary()), (pa.large_binary(), pa.large_binary())])

GH-35289: [Python] Support large variable width types in numpy conversion #36701

Are you sure you want to change the base?

GH-35289: [Python] Support large variable width types in numpy conversion #36701

Conversation

Kimahriman commented Jul 15, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Jul 15, 2023

AlenkaF commented Aug 22, 2023 • edited Loading

jorisvandenbossche commented Aug 22, 2023

Kimahriman commented Aug 22, 2023

Kimahriman commented Aug 23, 2023

AlenkaF left a comment

Choose a reason for hiding this comment

Kimahriman commented Nov 16, 2023

Kimahriman commented May 21, 2024

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche May 22, 2024

Choose a reason for hiding this comment

Kimahriman May 22, 2024

Choose a reason for hiding this comment

jorisvandenbossche May 23, 2024

Choose a reason for hiding this comment

Kimahriman commented May 22, 2024

jorisvandenbossche May 23, 2024

Choose a reason for hiding this comment

Kimahriman May 23, 2024

Choose a reason for hiding this comment

Kimahriman May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Kimahriman commented Jul 15, 2023 •

edited by github-actions bot

Loading

AlenkaF commented Aug 22, 2023 •

edited

Loading

Kimahriman May 23, 2024 •

edited

Loading