Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35289: [Python] Support large variable width types in numpy conversion #36701

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Kimahriman
Copy link

@Kimahriman Kimahriman commented Jul 15, 2023

Rationale for this change

Add support for LargeBinaryType and LargeStringType in NumPy to Arrow conversion.

What changes are included in this PR?

Adds new Visit methods in NumPyConverter for LargeStringType and LargeBinaryType. These are mostly copy-pastes of the non-large methods, just without the chunking. Since the chunking is specifically for getting around the 2GiB limit of non-large variable width types, it didn't seem like it made sense to build a chunked builder for large types. I also had to create a copy of AppendUTF32. If there's a way to consolidate the common code let me know, this is my first C++ in a long time.

Also adds the Arrow -> NumPy type map for the large binary types.

Are these changes tested?

New test added showing a string and binary array over 2 GiB is valid and still a single chunk. Also added the new Arrow to NumPy maps to a schema test.

Are there any user-facing changes?

Adds support for converting NumPy string/binary lists to large binary types.

@github-actions
Copy link

⚠️ GitHub issue #35289 has been automatically assigned in GitHub to PR creator.

@AlenkaF
Copy link
Member

AlenkaF commented Aug 22, 2023

Thank you for the contribution @Kimahriman !

The PR looks good!
As you have mentioned, there might be some improvement possible with consolidating the common code.

In the C++ part, where I am a novice also - but, I would try with creating a helper Status function and see if I can make it work with both builder instances or not (am not really sure it will work).

As for the python test, it could also be consolidated with test_numpy_binary_overflow_to_chunked with parametrizing the test function as for example in

@pytest.mark.parametrize('list_type_factory', [pa.list_, pa.large_list])
.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Aug 22, 2023
@jorisvandenbossche
Copy link
Member

If there's a way to consolidate the common code let me know, this is my first C++ in a long time.

In the Visit functions, I think the main (only?) difference is the type of builder that was created? In that case I think it should be possible to template this: have a single VisitString() that is templated on the builder type, and then the Visit(const StringType& type) could be a small wrapper around calling VisitString<ChunkedStringBuilder>(). Although the builder having different parameters to instantiate it might complicate things ..

@Kimahriman
Copy link
Author

Thanks for the ideas, I'll see if I can figure anything out 😅

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 23, 2023
@Kimahriman
Copy link
Author

Ok I learned about C++ templates and deduped the different functions, and I just parametrized the existing python test

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! The PR looks good IMO +1.
@jorisvandenbossche could you give one more look before merging?

@Kimahriman
Copy link
Author

@jorisvandenbossche gentle ping

@Kimahriman
Copy link
Author

@jorisvandenbossche gentle ping again. Merged in master and resolved the conflict

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

And apologies for the very slow follow-up ..

I have one more request: could you add a simple test for this as well (the large_memory test you edited it good to have, but in most our CI builds those are skipped, so it would be good to have a simple small test as well that is run everywhere).
For example there is a test_array_from_numpy_ascii and test_array_from_numpy_unicode. You could add one case to those tests with specifying the type as the large variant.

Comment on lines 232 to 233
template <typename T>
Status VisitString(T* builder);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick, but could you move those declarations of the helpers into protected: (eg just below VisitNative definition)

(I don't think anyone outside of pyarrow is using this, but just to keep it consistent)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the VisitBinary and VisitString declarations specifically?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly as what you did

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 22, 2024
@Kimahriman
Copy link
Author

Looks good!

And apologies for the very slow follow-up ..

I have one more request: could you add a simple test for this as well (the large_memory test you edited it good to have, but in most our CI builds those are skipped, so it would be good to have a simple small test as well that is run everywhere). For example there is a test_array_from_numpy_ascii and test_array_from_numpy_unicode. You could add one case to those tests with specifying the type as the large variant.

Parameterized those tests to specify the type and use regular and large types

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 22, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 23, 2024
@@ -2355,32 +2355,33 @@ def test_array_from_numpy_timedelta_incorrect_unit():
pa.array(data)


def test_array_from_numpy_ascii():
@pytest.mark.parametrize('binary_type', [pa.binary(), pa.large_binary()])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, one last comment (promised! ;)): I think we want to keep testing the below also in the case of not specifying the type (in which case we infer the small types), and I am not entirely sure this is explicitly covered elsewhere (implicitly for sure). But so could parametrize this slightly differently with:

Suggested change
@pytest.mark.parametrize('binary_type', [pa.binary(), pa.large_binary()])
@pytest.mark.parametrize('typ, expected_type', [(None, pa.binary()), (pa.binary(), pa.binary()), (pa.large_binary(), pa.large_binary())])

And then the same for string test below.

Or, maybe simpler, just duplicate the first case to have a version without a type specified:

    # without specified type, always binary
    arrow_arr = pa.array(arr)
    assert arrow_arr.type == 'binary'
    expected = ..

    arrow_arr = pa.array(arr, binary_type)
    assert arrow_arr.type == binary_type
    expected = ..

(I assume that for the inference it shouldn't matter if there are strides or not)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about that as I was making the update hah, will add back the inferring

Copy link
Author

@Kimahriman Kimahriman May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a third parameter and just set the expected type in the func

@pytest.mark.parametrize('string_type', [None, pa.utf8(), pa.large_utf8()])
def test_array_from_numpy_unicode(string_type):
    # Default when no type is specified should be utf8
    expected_type = string_type or pa.utf8()

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Converting from NumPy to large_string or large_binary returns not implemented
3 participants