-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12724: [C++] Add documentation for authoring compute kernels #10296
ARROW-12724: [C++] Add documentation for authoring compute kernels #10296
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format?
or
See also: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document will need to be referenced from a higher-level document in order to be available in the doc navigation.
|
||
An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html. | ||
|
||
The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you wrap long lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it seems you're using Markdown syntax. These documents should be authored using restructuredText.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO this doesn't have a clear flow. It front-loads definitions and lists of attributes then proceeds to a set of code snippets. This might be useful as a reference document, but I think it'd be more readable to structure this as a how-to guide/tutorial.
I think the tutorial should start with a statement of intent; and explanation in simple terms of what's missing and what will be accomplished by the end of the document. Technical details can then be introduced when (iff) they are necessary to current narrative step.
For example, you could start with browsing the documentation for useful compute functions and displaying how they are used from python. After noting that absolute_value is absent, you could show how to (aspirationally) add it to compute.rst
to make it clear where we'll be at the end of the tutorial.
The discussion of different function kinds then focuses naturally on absolute_value's kind: SCALAR (the most common kind) with the specific examples of absolute_value and existing functions referenced in the first paragraph to illustrate what this implies about the function's execution.
In summary, I think exhaustive coverage of the complexity of the compute module is less useful than a targeted tour of a common modification.
|
||
A function that computes scalar summary statistics from array input. | ||
|
||
### Hash aggregate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Hash aggregate | |
Hash aggregate | |
~~~~~~~~~~~~~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, please ensure that all your links are in the RST format. For example, to create a link to the doxygen doc for a specific class member, use:
:member:`ScalarKernel::exec`
To create a link to a specific source file on the master
branch, use:
`The scalar API header <https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h>`__
Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix). | ||
|
||
```C++ | ||
SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, please avoid use of macros like SCALAR_ARITHMETIC_UNARY()
and other context specific helpers for this walk through, as they obscure the underlying C++ and require a user to look up the macro's definition to understand what they're writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was helpful for me. Perhaps it would be better to explain when such macros should be used and explain this particular macro so that these are used consistently within Arrow.
Define compute function | ||
----------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to avoid ambiguity between the instance of compute::Function
which is being added to the registry and the convenience function which is being added to api_scalar.h
. The former is the canonical definition. The latter is a wrapper for easy use from C++, and probably isn't necessary for most compute functions. In fact for simplicity it might be better to avoid modifying api_scalar.h
in this tutorial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be helpful to say something about impacts on language bindings, assuming that most of these will use what is in the registry rather tan the C++ convenience function.
Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories. | ||
|
||
`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a simple (un-templated, non-suite) C++ test case is a necessary code snippet. For example,
// scalar_arithmetic_test.cc
TEST(AbsoluteValue, IntegralInputs) {
for (auto type : {int8(), int16(), int32(), int64()}) {
CheckScalarUnary("absolute_value", int8(), "[]", int8(), "[]");
CheckScalarUnary("absolute_value", int8(), "[0, -1, 1, 2, -2, 3, -3]", int8(),
"[0, 1, 1, 2, 2, 3, 3]");
}
}
Note that the above compiles and executes before adding the absolute_value function (IMHO it's a useful clarification of intended behavior to add a failing test like this as early as possible in the addition of a new function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also add untyped test fixtures (TEST_F
s)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May also want to link to the primer documentation, as the different types of tests are a source for confusion, as indicated in https://stackoverflow.com/questions/58600728/what-is-the-difference-between-test-test-f-and-test-p
Co-authored-by: Benjamin Kietzman <[email protected]>
This is a sort of confusion I had when I first started writing kernels. |
* Containment tests | ||
* Structural transforms | ||
* Conversions | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add the comment by @pitrou in Zulip here.
Simple questions for whether a function is a scalar function:
- Do all inputs have the same (broadcasted) length?
- Does the Nth element in the output only depend on the Nth element of each input?
* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point). | ||
* For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag. | ||
* For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should discuss the guarantees provided by the compute
infrastructure as well.
ex: for scalar functions, if multiple arrays are passed, the compute infrastructure checks for nullity, guarantees that they are of same size, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be helpful.
`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`. | ||
|
||
Helpers | ||
======= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be nicer if we could discuss about some helper methods in codegen_internal.h
.
Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories. | ||
|
||
`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also add untyped test fixtures (TEST_F
s)?
One further consideration is interface design. This seems like it is still being stabilized but guiding principles would be helpful. |
|
||
The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc. | ||
|
||
Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values. | |
Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time with output length the same as the input length. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values. |
Maybe this is clearer
Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values. | ||
|
||
Terminology: | ||
* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options. | |
* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of input data types or function behavior options. |
Compute Functions | ||
================= | ||
|
||
An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html. | |
An introduction to compute functions is provided in `compute documentation <https://arrow.apache.org/docs/cpp/compute.html>`_ . |
|
||
An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html. | ||
|
||
The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc. | |
The `compute submodule <https://https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute>`_ contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc. |
* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected. | ||
* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel. | ||
|
||
[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes: | |
`Compute functions <https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h>`_ have the following principal attributes: |
* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel. | ||
|
||
[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes: | ||
* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings | |
* A unique :doc:`name <.compute>` used for function invocation and language bindings |
Adding a section heading to the approriate part of the compute documentation will give a better link. See restructured text documentation
which indicates in what context it is valid for use | ||
* Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes) | ||
* Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`. | ||
* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures. | |
* Compute functions (see `FunctionImpl and subclasses <https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h>`_) contain `"kernels" <https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels>`_ which are implementations for specific argument signatures. |
~~~~~~ | ||
|
||
A function that performs scalar data operations on whole arrays of | ||
data. Can generally process Array or Scalar values. The size of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data. Can generally process Array or Scalar values. The size of the | |
data. Can generally process array or scalar values. The size of the |
Do not expect this needs to be capitalized. This is a very helpful explanation.
A function that performs scalar data operations on whole arrays of | ||
data. Can generally process Array or Scalar values. The size of the | ||
output will be the same as the size (or broadcasted size, in the case | ||
of mixing Array and Scalar inputs) of the input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of mixing Array and Scalar inputs) of the input. | |
of mixing array and scalar inputs) of the input. |
|
||
https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions | ||
|
||
**Categories of Scalar functions** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**Categories of Scalar functions** | |
**Categories of Scalar Functions** |
* predicates | ||
* transforms | ||
* trimming | ||
* splitting | ||
* extraction | ||
* Containment tests | ||
* Structural transforms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* predicates | |
* transforms | |
* trimming | |
* splitting | |
* extraction | |
* Containment tests | |
* Structural transforms | |
* Predicates | |
* Transforms | |
* Trimming | |
* Splitting | |
* Extraction | |
* Containment Tests | |
* Structural Transforms |
More consistent capitalization
A function with array input and output whose behavior depends on the | ||
values of the entire arrays passed, rather than the value of each scalar value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A function with array input and output whose behavior depends on the | |
values of the entire arrays passed, rather than the value of each scalar value. | |
A function with array input and output whose behavior depends on combinations of | |
values at different locations in the input arrays, rather than the independent computations | |
on scalar values at the same location in the input arrays. |
**Categories of Vector functions** | ||
|
||
* Associative transforms | ||
* Selections | ||
* Sorts and partitions | ||
* Structural transforms | ||
|
||
|
||
Scalar aggregate | ||
~~~~~~~~~~~~~~~~ | ||
|
||
A function that computes scalar summary statistics from array input. | ||
|
||
### Hash aggregate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**Categories of Vector functions** | |
* Associative transforms | |
* Selections | |
* Sorts and partitions | |
* Structural transforms | |
Scalar aggregate | |
~~~~~~~~~~~~~~~~ | |
A function that computes scalar summary statistics from array input. | |
### Hash aggregate | |
**Categories of Vector Functions** | |
* Associative Transforms | |
* Selections | |
* Sorts and Partitions | |
* Structural Transforms | |
Scalar Aggregate | |
~~~~~~~~~~~~~~~~ | |
A function that computes scalar summary statistics from array input. | |
### Hash Aggregate |
Consistent capitalization
Kernels | ||
------- | ||
|
||
Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation. | |
Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation. |
|
||
Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation. | ||
|
||
Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template. | |
Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage C++'s powerful template system to efficiently generate kernel methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template. |
Maybe a link to an online guide to templating in C++ as used in Arrow would be helpful.
* arrow/util/int_util_internal.h - defines utility functions | ||
* Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* arrow/util/int_util_internal.h - defines utility functions | |
* Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library. | |
* `arrow/util/int_util_internal.h <https://github.com/apache/arrow/tree/master/cpp/src/arrow/util/int_util_internal.h>`_ - defines utility functions | |
* Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in the `"portable_snippets" <https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h>`_ library. |
Links to the files are helpful.
* arrow/util/int_util_internal.h - defines utility functions | ||
* Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library. | ||
|
||
* compute/api_scalar.h - contains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* compute/api_scalar.h - contains | |
* `arrow/compute/api_scalar.h <https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/api_scalar.h>`_ - contains |
Consistent abbreviated path names make readability easier.
* compute/api_scalar.h - contains | ||
* Subclasses of `FunctionOptions` for specific categories of compute functions | ||
* API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function. | ||
* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode. | |
* `arrow/compute/api_scalar.cc <https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/api_scalar.cc>`_ - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode. |
* API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function. | ||
* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode. | ||
* Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name. | ||
* Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant). | |
* Macros of the form `SCALAR_*` invoke `CallFunction` require two function names, default which behaves like `SCALAR_EAGER_*`, and `_checked` variant which checks for overflow in the calculations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Underflow may also be problematic.
* Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name. | ||
* Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant). | ||
|
||
* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point). | |
* `arrow/compute/kernels/scalar_arithmetic.cc <https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc>`_ - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point). |
Kernel dispatcher | ||
----------------- | ||
|
||
* compute/exec.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* compute/exec.h | |
* `arrow/compute/exec.h <https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/exec.hc>`_ |
Register kernels of compute function | ||
------------------------------------ | ||
|
||
1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A description of NotNull and other variants may be helpful, though this could also go in the main user documentation since it has performance and correctness implications.
|
||
1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case. | ||
|
||
1. Create the kernels by invoking the kernel generators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is vague. Add a link to an example kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe indicate that example code links are provided in a later section.
|
||
1. Create the kernels by invoking the kernel generators. | ||
|
||
1. Register the kernels in the corresponding registry along with its `FunctionDoc`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A link would be helpful for this as well. Perhaps choose one kernel and use it to illustrate the points that are made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe indicate that example code links are provided in a later section.
1. Input/output types: Numerical (signed and unsigned, integral and floating-point) | ||
1. Input/output shapes: operate on scalars or element-wise for arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Input/output types: Numerical (signed and unsigned, integral and floating-point) | |
1. Input/output shapes: operate on scalars or element-wise for arrays | |
1. Input types: Numerical (signed and unsigned, integral and floating-point) | |
1. Input shapes: Operate on scalars or element-wise for arrays | |
1. Output types: Numerical, same as input | |
1. Output shapes: Same as input shape |
The documentation does not combine Input and Output,
which improves clarity because dependence of output on input can be made clear.
```C++ | ||
struct AbsoluteValue { | ||
template <typename T, typename Arg> | ||
static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) { | ||
return (arg < static_cast<T>(0)) ? -arg : arg; | ||
} | ||
|
||
template <typename T, typename Arg> | ||
static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) { | ||
return arg; | ||
} | ||
|
||
template <typename T, typename Arg> | ||
static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) { | ||
return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg; | ||
} | ||
}; | ||
|
||
struct AbsoluteValueChecked { | ||
template <typename T, typename Arg> | ||
static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) { | ||
static_assert(std::is_same<T, Arg>::value, ""); | ||
if (arg < static_cast<T>(0)) { | ||
T result = 0; | ||
if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) { | ||
*st = Status::Invalid("overflow"); | ||
} | ||
return result; | ||
} | ||
return arg; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
```C++ | |
struct AbsoluteValue { | |
template <typename T, typename Arg> | |
static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) { | |
return (arg < static_cast<T>(0)) ? -arg : arg; | |
} | |
template <typename T, typename Arg> | |
static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) { | |
return arg; | |
} | |
template <typename T, typename Arg> | |
static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) { | |
return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg; | |
} | |
}; | |
struct AbsoluteValueChecked { | |
template <typename T, typename Arg> | |
static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) { | |
static_assert(std::is_same<T, Arg>::value, ""); | |
if (arg < static_cast<T>(0)) { | |
T result = 0; | |
if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) { | |
*st = Status::Invalid("overflow"); | |
} | |
return result; | |
} | |
return arg; | |
} | |
.. code-block:: cpp | |
struct AbsoluteValue { | |
template <typename T, typename Arg> | |
static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) { | |
return (arg < static_cast<T>(0)) ? -arg : arg; | |
} | |
template <typename T, typename Arg> | |
static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) { | |
return arg; | |
} | |
template <typename T, typename Arg> | |
static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) { | |
return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg; | |
} | |
}; | |
struct AbsoluteValueChecked { | |
template <typename T, typename Arg> | |
static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) { | |
static_assert(std::is_same<T, Arg>::value, ""); | |
if (arg < static_cast<T>(0)) { | |
T result = 0; | |
if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) { | |
*st = Status::Invalid("overflow"); | |
} | |
return result; | |
} | |
return arg; | |
} | |
template <typename T, typename Arg> | ||
static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) { | ||
static_assert(std::is_same<T, Arg>::value, ""); | ||
return arg; | ||
} | ||
|
||
template <typename T, typename Arg> | ||
static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) { | ||
static_assert(std::is_same<T, Arg>::value, ""); | ||
return (arg < static_cast<T>(0)) ? -arg : arg; | ||
} | ||
}; | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
template <typename T, typename Arg> | |
static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) { | |
static_assert(std::is_same<T, Arg>::value, ""); | |
return arg; | |
} | |
template <typename T, typename Arg> | |
static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) { | |
static_assert(std::is_same<T, Arg>::value, ""); | |
return (arg < static_cast<T>(0)) ? -arg : arg; | |
} | |
}; | |
``` | |
template <typename T, typename Arg> | |
static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) { | |
static_assert(std::is_same<T, Arg>::value, ""); | |
return arg; | |
} | |
template <typename T, typename Arg> | |
static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) { | |
static_assert(std::is_same<T, Arg>::value, ""); | |
return (arg < static_cast<T>(0)) ? -arg : arg; | |
} | |
}; | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing formatting to a restructured text code block.
```C++ | ||
const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise", | ||
("Results will wrap around on integer overflow.\n" | ||
"Use function \"absolute_value_checked\" if you want overflow\n" | ||
"to return an error."), | ||
{"x"}}; | ||
|
||
const FunctionDoc absolute_value_checked_doc{ | ||
"Calculate the absolute value of the argument element-wise", | ||
("This function returns an error on overflow. For a variant that\n" | ||
"doesn't fail on overflow, use function \"absolute_value_checked\"."), | ||
{"x"}}; | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
```C++ | |
const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise", | |
("Results will wrap around on integer overflow.\n" | |
"Use function \"absolute_value_checked\" if you want overflow\n" | |
"to return an error."), | |
{"x"}}; | |
const FunctionDoc absolute_value_checked_doc{ | |
"Calculate the absolute value of the argument element-wise", | |
("This function returns an error on overflow. For a variant that\n" | |
"doesn't fail on overflow, use function \"absolute_value_checked\"."), | |
{"x"}}; | |
``` | |
.. code-block:: cpp | |
const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise", | |
("Results will wrap around on integer overflow.\n" | |
"Use function \"absolute_value_checked\" if you want overflow\n" | |
"to return an error."), | |
{"x"}}; | |
const FunctionDoc absolute_value_checked_doc{ | |
"Calculate the absolute value of the argument element-wise", | |
("This function returns an error on overflow. For a variant that\n" | |
"doesn't fail on overflow, use function \"absolute_value_checked\"."), | |
{"x"}}; | |
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use restructured text code block
```C++ | ||
auto absolute_value = MakeUnaryArithmeticFunction<AbsoluteValue>("absolute_value", &absolute_value_doc); | ||
auto absolute_value_checked = MakeUnaryArithmeticFunctionNotNull<AbsoluteValueChecked>("absolute_value_checked", &absolute_value_checked_doc); | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
```C++ | |
auto absolute_value = MakeUnaryArithmeticFunction<AbsoluteValue>("absolute_value", &absolute_value_doc); | |
auto absolute_value_checked = MakeUnaryArithmeticFunctionNotNull<AbsoluteValueChecked>("absolute_value_checked", &absolute_value_checked_doc); | |
``` | |
.. code-block:: cpp | |
auto absolute_value = MakeUnaryArithmeticFunction<AbsoluteValue>("absolute_value", &absolute_value_doc); | |
auto absolute_value_checked = MakeUnaryArithmeticFunctionNotNull<AbsoluteValueChecked>("absolute_value_checked", &absolute_value_checked_doc); | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use restructured text code block
```C++ | ||
DCHECK_OK(registry->AddFunction(std::move(absolute_value))); | ||
DCHECK_OK(registry->AddFunction(std::move(absolute_value_checked))); | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
```C++ | |
DCHECK_OK(registry->AddFunction(std::move(absolute_value))); | |
DCHECK_OK(registry->AddFunction(std::move(absolute_value_checked))); | |
``` | |
.. code-block:: cpp | |
DCHECK_OK(registry->AddFunction(std::move(absolute_value))); | |
DCHECK_OK(registry->AddFunction(std::move(absolute_value_checked))); | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use restructured text code block
```C++ | ||
ARROW_EXPORT | ||
Result<Datum> Hypotenuse(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR); | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
```C++ | |
ARROW_EXPORT | |
Result<Datum> Hypotenuse(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR); | |
``` | |
.. code-block:: cpp | |
ARROW_EXPORT | |
Result<Datum> Hypotenuse(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), | |
ExecContext* ctx = NULLPTR); | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use restructured text code block
```C++ | ||
SCALAR_ARITHMETIC_BINARY(Hypotenuse, "hypotenuse", "hypotenuse_checked") | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
```C++ | |
SCALAR_ARITHMETIC_BINARY(Hypotenuse, "hypotenuse", "hypotenuse_checked") | |
``` | |
.. code-block:: cpp | |
SCALAR_ARITHMETIC_BINARY(Hypotenuse, "hypotenuse", "hypotenuse_checked") | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use restructured text code block
@bkmgit Thanks for your reviews! I will get back to this PR and resolve them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an image showing what goes where could be also a quick way to protray the hierarchy and how calls are made when a function is invoked.
|
||
Terminology: | ||
* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options. | ||
* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected. | |
* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must ensure a kernel corresponding to the value types of the inputs is selected. |
or
* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected. | |
* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must select a kernel corresponding to the value types of the inputs. |
?
[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes: | ||
* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings | ||
* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE) | ||
which indicates in what context it is valid for use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which indicates in what context it is valid for use | |
indicates in what context it is valid for use |
* Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes) | ||
* Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`. | ||
* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures. | ||
* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments | |
* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) states the number of required arguments |
Files and structures of the computer layer | ||
========================================== | ||
|
||
This section describes the general structure of files/directory and principal code structures of the compute layer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section describes the general structure of files/directory and principal code structures of the compute layer. | |
This section describes the general structure of files/directory and principal code structures of the compute layer using scalar function <pick something>. |
followed by bullet points on how this function exists in code base. The raw information is present even now in the current list of points, but it feels like a flow of information is missing.
@edponce Are you planning to work on this in the future? |
@pitrou Wow, I do not know why I extended this PR so much. I will take tomorrow morning to resolve all the issues and present a full draft. |
Linking as reference this general outline for adding compute functions. |
I plan on subsuming this PR into a new one (#13933). I will try to make sure all of these reviews are accommodated. |
This PR extends to the compute layer documentation by describing a developer's process for authoring new compute functions. It describes the commonly used files, data structures, and functions for understanding compute functions. Also, it provides a tutorial with examples.