Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36905: [C++] Add support for SparseUnion to selection functions #36906

Merged
merged 6 commits into from
Aug 10, 2023

Conversation

js8544
Copy link
Collaborator

@js8544 js8544 commented Jul 27, 2023

Rationale for this change

Dense unions are already supported in Take, Filter and DropNull but sparse ones are not.

What changes are included in this PR?

Add kernels for sparse unions to those functions.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions
Copy link

⚠️ GitHub issue #36905 has been automatically assigned in GitHub to PR creator.

@pitrou
Copy link
Member

pitrou commented Aug 9, 2023

It seems that it reuses the DenseUnion approach, but it would be more efficient to reuse the Struct approach. What do you think?

@js8544
Copy link
Collaborator Author

js8544 commented Aug 9, 2023

It seems that it reuses the DenseUnion approach, but it would be more efficient to reuse the Struct approach. What do you think?

Right, I've changed it to the struct approach. But there is room for improvement for SparseUnion: the unselect children can have any value, so we don't have to call take with the same indices for every child. I've left a TODO comment in the code for this.

@pitrou
Copy link
Member

pitrou commented Aug 9, 2023

But there is room for improvement for SparseUnion: the unselect children can have any value, so we don't have to call take with the same indices for every child.

We don't, but would it improve anything to use different indices for each child?

Comment on lines 764 to 765
int8_t child_id = typed_values.child_id(index);
child_id_buffer_builder_.UnsafeAppend(type_codes_[child_id]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be doing a pointless back-and-forth between type codes and child ids?

Suggested change
int8_t child_id = typed_values.child_id(index);
child_id_buffer_builder_.UnsafeAppend(type_codes_[child_id]);
child_id_buffer_builder_.UnsafeAppend(typed_values.type_code(index));

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -863,6 +920,22 @@ Status DenseUnionFilterExec(KernelContext* ctx, const ExecSpan& batch, ExecResul
return FilterExec<DenseUnionSelectionImpl>(ctx, batch, out);
}

Status SparseUnionFilterExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move this into vector_filter_internal.cc along StructFilterExec? (can probably also share some code between them...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved and extracted a FilterWithTakeExec.

.Value(&indices));

Datum result;
RETURN_NOT_OK(Take(batch[0].array.ToArrayData(), Datum(indices),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can call SparseUnionTakeExec directly instead of going through the function lookup and execution machinery again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

+---------------+--------+--------------+--------------+--------------+-------------------------+-----------+
| take | Binary | Any | Integer | Input type 1 | :struct:`TakeOptions` | \(1) \(4) |
| take | Binary | Any | Integer | Input type 1 | :struct:`TakeOptions` | \(4) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this should be

Suggested change
| take | Binary | Any | Integer | Input type 1 | :struct:`TakeOptions` | \(4) |
| take | Binary | Any | Integer | Input type 1 | :struct:`TakeOptions` | \(3) |

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, and the previous line was also wrong.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 9, 2023
@js8544
Copy link
Collaborator Author

js8544 commented Aug 9, 2023

We don't, but would it improve anything to use different indices for each child?

Judging from https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection_take_internal.cc#L369, we can make the unneeded indices the same as the needed ones, so that accessing values_data[indices_data[position]] is more cache friendly. But I agree that this is very subtle.

@pitrou
Copy link
Member

pitrou commented Aug 9, 2023

Yes, this is quite subtle. Generating the indices arrays would cost much more, so I'm not sure it would be beneficial at the end.

@js8544
Copy link
Collaborator Author

js8544 commented Aug 9, 2023

Yes, this is quite subtle. Generating the indices arrays would cost much more, so I'm not sure it would be beneficial at the end.

OK I'll remove that comment.

@js8544 js8544 requested a review from pitrou August 9, 2023 16:01
- avoid copying type codes array
- remove skipping of sliced tests on dense unions
- add union-take test with null indices
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks a lot for this @js8544

@pitrou pitrou merged commit ebcf7bc into apache:main Aug 10, 2023
33 of 35 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Aug 10, 2023
@js8544
Copy link
Collaborator Author

js8544 commented Aug 10, 2023

Thanks for the improvements!

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit ebcf7bc.

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…ons (apache#36906)

### Rationale for this change

Dense unions are already supported in Take, Filter and DropNull but sparse ones are not.

### What changes are included in this PR?

Add kernels for sparse unions to those functions.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No. 

* Closes: apache#36905

Lead-authored-by: Jin Shang <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Add support for SparseUnion to selection functions
2 participants