Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44393: [C++][Compute] Swizzle vector functions #44394

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

zanmato1984
Copy link
Contributor

@zanmato1984 zanmato1984 commented Oct 13, 2024

Rationale for this change

For background please see #44393.

When implementing the "scatter" function requested in #44393, I found it also useful to make it a public vector API. After a painful thinking, I decided to name it "permute". And when implementing permute, I found it fairly easy to implement it by first computing the "reverse indices" of the positions, and then invoking the existing "take", where I think "reverse_indices" itself can also be a useful public vector API. Thus the PR categorized them as "placement functions".

What changes are included in this PR?

Implement placement API reverse_indices and permute, where permute(values, indices) is implemented as take(values, reverse_indices(indices)).

I also put a small UT Permute.IfElse demonstrating how permute can serve as a building block of implementing special forms.

Are these changes tested?

UT included.

Are there any user-facing changes?

Yes, new public APIs added. Documents updated.

Copy link

⚠️ GitHub issue #44393 has been automatically assigned in GitHub to PR creator.

@zanmato1984
Copy link
Contributor Author

Hi, @felipecrv @pitrou . Could you help to review this? I think this is a necessary building block for special forms (#41094 (comment)). Much appreciated!

@zanmato1984 zanmato1984 changed the title GH-44393: Placement vector functions GH-44393: [C++][Compute] Placement vector functions Oct 13, 2024
@felipecrv
Copy link
Contributor

@zanmato1984 I'm not doing code reviews with the same frequency lately, but I might take a look at this one during the week.

cpp/src/arrow/compute/api_vector.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/api_vector.h Outdated Show resolved Hide resolved
///
/// For indices[i] = x, reverse_indices[x] = i. And reverse_indices[x] = null if x does
/// not appear in the input indices. For indices[i] = x where x < 0 or x >= output_length,
/// it is ignored. If multiple indices point to the same value, the last one is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this explanation is confusing, but we can work on this later.

cpp/src/arrow/compute/api_vector.h Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Oct 31, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 31, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 4, 2024
@zanmato1984
Copy link
Contributor Author

Hi @felipecrv , the renaming is done. Would you like to proceed with the review? Thank you.

@zanmato1984 zanmato1984 changed the title GH-44393: [C++][Compute] Placement vector functions GH-44393: [C++][Compute] Swizzle vector functions Nov 7, 2024
Comment on lines +766 to +767
/// in the input indices. For indices[i] = x where x < 0 or x > max_index, values[i]
/// is ignored. If multiple indices point to the same value, the last one is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// in the input indices. For indices[i] = x where x < 0 or x > max_index, values[i]
/// is ignored. If multiple indices point to the same value, the last one is used.
/// in the input indices. For indices[i] = x where x < 0 or x > max_index, values[i]
/// is ignored. If multiple indices point to the same value, the last one is used.

Can you explain the point of the lenient behavior wrt. negative indices and the max_index option?
Is there a use case that this enables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No there isn't a particular case in my mind that index < 0 or index > max_index can be very useful, so we can alternatively throw exceptions if we see such values, or even not detect them at all.

(And just in case you are curious about the option max_index itself, it is essential for cases that the input indices are "sparse", please see my other comment #44394 (comment) for details of such usage.)

Thank you.

ArrayKernelExec GenerateInteger(detail::GetTypeId get_id) {
template <template <typename...> class Generator, typename Type0,
typename KernelType = ArrayKernelExec, typename... Args>
KernelType GenerateInteger(detail::GetTypeId get_id) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this change is for generate exec_chunked?

Copy link
Contributor Author

@zanmato1984 zanmato1984 Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Just slightly extending it like GenerateNumeric.

/// the same type as the input indices, otherwise must be integer types. An invalid
/// error will be reported if this type is not able to store the length of the input
/// indices.
std::shared_ptr<DataType> output_type = NULLPTR;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should nullable being considered here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be decided during the actual computation. If there are "holes" in the output, validity buffer will be allocated and filled on demand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants