Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12724: [C++] Add documentation for authoring compute kernels #10296

Conversation

edponce
Copy link
Contributor

@edponce edponce commented May 11, 2021

This PR extends to the compute layer documentation by describing a developer's process for authoring new compute functions. It describes the commonly used files, data structures, and functions for understanding compute functions. Also, it provides a tutorial with examples.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@edponce edponce changed the title [C++] Add documentation for authoring compute kernels ARROW-12724: [C++] Add documentation for authoring compute kernels May 11, 2021
@github-actions
Copy link

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This document will need to be referenced from a higher-level document in order to be available in the doc navigation.


An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.

The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you wrap long lines?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems you're using Markdown syntax. These documents should be authored using restructuredText.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this doesn't have a clear flow. It front-loads definitions and lists of attributes then proceeds to a set of code snippets. This might be useful as a reference document, but I think it'd be more readable to structure this as a how-to guide/tutorial.

I think the tutorial should start with a statement of intent; and explanation in simple terms of what's missing and what will be accomplished by the end of the document. Technical details can then be introduced when (iff) they are necessary to current narrative step.

For example, you could start with browsing the documentation for useful compute functions and displaying how they are used from python. After noting that absolute_value is absent, you could show how to (aspirationally) add it to compute.rst to make it clear where we'll be at the end of the tutorial.

The discussion of different function kinds then focuses naturally on absolute_value's kind: SCALAR (the most common kind) with the specific examples of absolute_value and existing functions referenced in the first paragraph to illustrate what this implies about the function's execution.

In summary, I think exhaustive coverage of the complexity of the compute module is less useful than a targeted tour of a common modification.

docs/source/cpp/authoring_compute_functions.rst Outdated Show resolved Hide resolved

A function that computes scalar summary statistics from array input.

### Hash aggregate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Hash aggregate
Hash aggregate
~~~~~~~~~~~~~~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, please ensure that all your links are in the RST format. For example, to create a link to the doxygen doc for a specific class member, use:

 :member:`ScalarKernel::exec`

To create a link to a specific source file on the master branch, use:

`The scalar API header <https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h>`__

Recall that "Arithmetic" functions create two kernel variants: default and overflow-checking. Therefore, we use the `SCALAR_ARITHMETIC_UNARY` macro which requires two function names (with and without "_checked" suffix).

```C++
SCALAR_ARITHMETIC_UNARY(AbsoluteValue, "absolute_value", "absolute_value_checked")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, please avoid use of macros like SCALAR_ARITHMETIC_UNARY() and other context specific helpers for this walk through, as they obscure the underlying C++ and require a user to look up the macro's definition to understand what they're writing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was helpful for me. Perhaps it would be better to explain when such macros should be used and explain this particular macro so that these are used consistently within Arrow.

Comment on lines +265 to +266
Define compute function
-----------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to avoid ambiguity between the instance of compute::Function which is being added to the registry and the convenience function which is being added to api_scalar.h. The former is the canonical definition. The latter is a wrapper for easy use from C++, and probably isn't necessary for most compute functions. In fact for simplicity it might be better to avoid modifying api_scalar.h in this tutorial

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be helpful to say something about impacts on language bindings, assuming that most of these will use what is in the registry rather tan the C++ convenience function.

Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.

`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a simple (un-templated, non-suite) C++ test case is a necessary code snippet. For example,

// scalar_arithmetic_test.cc
TEST(AbsoluteValue, IntegralInputs) {
  for (auto type : {int8(), int16(), int32(), int64()}) {
    CheckScalarUnary("absolute_value", int8(), "[]", int8(), "[]");

    CheckScalarUnary("absolute_value", int8(), "[0, -1, 1, 2, -2, 3, -3]", int8(),
                     "[0, 1, 1, 2, 2, 3, 3]");
  }
}

Note that the above compiles and executes before adding the absolute_value function (IMHO it's a useful clarification of intended behavior to add a failing test like this as early as possible in the addition of a new function).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also add untyped test fixtures (TEST_Fs)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May also want to link to the primer documentation, as the different types of tests are a source for confusion, as indicated in https://stackoverflow.com/questions/58600728/what-is-the-difference-between-test-test-f-and-test-p

Co-authored-by: Benjamin Kietzman <[email protected]>
@nirandaperera
Copy link
Contributor

nirandaperera commented Jun 3, 2021

This is a sort of confusion I had when I first started writing kernels.
"A 'scalar' is a single (non-array) element! But how come "Scalar functions" accept and produce arrays?"
But now I understand, even though arrays are passed, the function is applied on each scalar in the array independently.
Do you this is something we'd want to explicitly discuss in the doc?
May be use an alternative jargon like, "element-wise and vector functions"?

* Containment tests
* Structural transforms
* Conversions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add the comment by @pitrou in Zulip here.

Simple questions for whether a function is a scalar function:
- Do all inputs have the same (broadcasted) length?
- Does the Nth element in the output only depend on the Nth element of each input?

* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
* For compute functions that may trigger overflow the "checked" variant is a class suffixed with `Checked` and makes use of assertions and overflow checks. If overflow occurs, kernel returns zero and sets that `Status*` error flag.
* For compute functions that do not have a valid mathematical operation for specific datatypes (e.g., negate an unsigned integer), the kernel for those types is provided but should trigger an error with `DCHECK(false) << This is included only for the purposes of instantiability from the "arithmetic kernel generator"` and return zero.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should discuss the guarantees provided by the compute infrastructure as well.
ex: for scalar functions, if multiple arrays are passed, the compute infrastructure checks for nullity, guarantees that they are of same size, etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be helpful.

`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.

Helpers
=======
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nicer if we could discuss about some helper methods in codegen_internal.h.

Arrow uses Google test framework. All kernels should have tests to ensure stability of the compute layer. Tests should at least cover ordinary inputs, corner cases, extreme values, nulls, different data types, and invalid tests. Moreover, there can be kernel-specific tests. For example, for arithmetic kernels, tests should include `NaN` and `Inf` inputs. The test files are located alongside the kernel source files and suffixed with `_test`. Tests are grouped by compute function `kind` and categories.

`TYPED_TEST(test suite name, compute function)` - wrapper to define tests for the given compute function. The `test suite name` is associated with a set of data types that are used for the test suite (`TYPED_TEST_SUITE`). Tests from multiple compute functions can be placed in the same test suite. For example, `TYPED_TEST(TestBinaryArithmeticFloating, Sub)` and `TYPED_TEST(TestBinaryArithmeticFloating, Mul)`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also add untyped test fixtures (TEST_Fs)?

@bkmgit
Copy link
Contributor

bkmgit commented Dec 29, 2021

One further consideration is interface design. This seems like it is still being stabilized but guiding principles would be helpful.


The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.

Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.
Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time with output length the same as the input length. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.

Maybe this is clearer

Many functions have SQL-like semantics in that they perform element-wise or scalar operations on whole arrays at a time. Other functions are not SQL-like and compute results that may be a different length or whose results depend on the order of the values.

Terminology:
* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of input data types or function behavior options.

Compute Functions
=================

An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.
An introduction to compute functions is provided in `compute documentation <https://arrow.apache.org/docs/cpp/compute.html>`_ .


An introduction to compute functions is provided in https://arrow.apache.org/docs/cpp/compute.html.

The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The [compute submodule](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute) contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.
The `compute submodule <https://https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute>`_ contains analytical functions that process primarily columnar data for either scalar or Arrow-based array inputs. These are intended for use inside query engines, data frame libraries, etc.

* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.

[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
`Compute functions <https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h>`_ have the following principal attributes:

* Functions along with their kernel implementations are collected in a "function registry". Given a function name and argument types, we can look up that function and dispatch to a compatible kernel.

[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
* A unique :doc:`name <.compute>` used for function invocation and language bindings

Adding a section heading to the approriate part of the compute documentation will give a better link. See restructured text documentation

which indicates in what context it is valid for use
* Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
* Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
* Compute functions (see `FunctionImpl and subclasses <https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h>`_) contain `"kernels" <https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels>`_ which are implementations for specific argument signatures.

~~~~~~

A function that performs scalar data operations on whole arrays of
data. Can generally process Array or Scalar values. The size of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data. Can generally process Array or Scalar values. The size of the
data. Can generally process array or scalar values. The size of the

Do not expect this needs to be capitalized. This is a very helpful explanation.

A function that performs scalar data operations on whole arrays of
data. Can generally process Array or Scalar values. The size of the
output will be the same as the size (or broadcasted size, in the case
of mixing Array and Scalar inputs) of the input.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
of mixing Array and Scalar inputs) of the input.
of mixing array and scalar inputs) of the input.


https://arrow.apache.org/docs/cpp/compute.html#arithmetic-functions

**Categories of Scalar functions**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Categories of Scalar functions**
**Categories of Scalar Functions**

Comment on lines +78 to +84
* predicates
* transforms
* trimming
* splitting
* extraction
* Containment tests
* Structural transforms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* predicates
* transforms
* trimming
* splitting
* extraction
* Containment tests
* Structural transforms
* Predicates
* Transforms
* Trimming
* Splitting
* Extraction
* Containment Tests
* Structural Transforms

More consistent capitalization

Comment on lines +91 to +92
A function with array input and output whose behavior depends on the
values of the entire arrays passed, rather than the value of each scalar value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A function with array input and output whose behavior depends on the
values of the entire arrays passed, rather than the value of each scalar value.
A function with array input and output whose behavior depends on combinations of
values at different locations in the input arrays, rather than the independent computations
on scalar values at the same location in the input arrays.

Comment on lines +94 to +107
**Categories of Vector functions**

* Associative transforms
* Selections
* Sorts and partitions
* Structural transforms


Scalar aggregate
~~~~~~~~~~~~~~~~

A function that computes scalar summary statistics from array input.

### Hash aggregate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Categories of Vector functions**
* Associative transforms
* Selections
* Sorts and partitions
* Structural transforms
Scalar aggregate
~~~~~~~~~~~~~~~~
A function that computes scalar summary statistics from array input.
### Hash aggregate
**Categories of Vector Functions**
* Associative Transforms
* Selections
* Sorts and Partitions
* Structural Transforms
Scalar Aggregate
~~~~~~~~~~~~~~~~
A function that computes scalar summary statistics from array input.
### Hash Aggregate

Consistent capitalization

Kernels
-------

Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.
Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.


Kernels are simple ``structs`` containing only function pointers (the "methods" of the kernel) and attribute flags. Each function kind corresponds to a class of Kernel with methods representing each stage of the function's execution. For example, :struct:`ScalarKernel` includes (optionally) :member:`ScalarKernel::init` to initialize any state necessary for execution and :member:`ScalarKernel::exec` to perform the computation.

Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage c++'s powerful template system to efficiently generate kernels' methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.
Since many kernels are closely related in operation and differ only in their input types, it's frequently useful to leverage C++'s powerful template system to efficiently generate kernel methods. For example, the "add" compute function accepts all numeric types and its kernels' methods are instantiations of the same function template.

Maybe a link to an online guide to templating in C++ as used in Arrow would be helpful.

Comment on lines +143 to +144
* arrow/util/int_util_internal.h - defines utility functions
* Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
Copy link
Contributor

@bkmgit bkmgit Dec 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* arrow/util/int_util_internal.h - defines utility functions
* Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.
* `arrow/util/int_util_internal.h <https://github.com/apache/arrow/tree/master/cpp/src/arrow/util/int_util_internal.h>`_ - defines utility functions
* Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in the `"portable_snippets" <https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h>`_ library.

Links to the files are helpful.

* arrow/util/int_util_internal.h - defines utility functions
* Function definitions suffixed with `WithOverflow` to support "safe math" for arithmetic kernels. Helper macros are included to create the definitions which invoke the corresponding operation in [`portable_snippets`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/vendored/portable-snippets/safe-math.h) library.

* compute/api_scalar.h - contains
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* compute/api_scalar.h - contains
* `arrow/compute/api_scalar.h <https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/api_scalar.h>`_ - contains

Consistent abbreviated path names make readability easier.

* compute/api_scalar.h - contains
* Subclasses of `FunctionOptions` for specific categories of compute functions
* API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
* `arrow/compute/api_scalar.cc <https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/api_scalar.cc>`_ - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.

* API/prototypes for all `Scalar` compute functions. Note that there is a single API version for each compute function.
* *compute/api_scalar.cc* - defines `Scalar` compute functions as wrappers over ["CallFunction"](https://arrow.apache.org/docs/cpp/api/compute.html?highlight=one%20shot#_CPPv412CallFunctionRKNSt6stringERKNSt6vectorI5DatumEEPK15FunctionOptionsP11ExecContext) (one-shot function). Arrow provides macros to easily define compute functions based on their `arity` and invocation mode.
* Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
* Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
Copy link
Contributor

@bkmgit bkmgit Dec 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).
* Macros of the form `SCALAR_*` invoke `CallFunction` require two function names, default which behaves like `SCALAR_EAGER_*`, and `_checked` variant which checks for overflow in the calculations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Underflow may also be problematic.

* Macros of the form `SCALAR_EAGER_*` invoke `CallFunction` directly and only require one function name.
* Macros of the form `SCALAR_*` invoke `CallFunction` after checking for overflow and require two function names (default and `_checked` variant).

* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
Copy link
Contributor

@bkmgit bkmgit Dec 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* compute/kernels/scalar_arithmetic.cc - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).
* `arrow/compute/kernels/scalar_arithmetic.cc <https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc>`_ - contains kernel definitions for "Scalar Arithmetic" compute functions. Kernel definitions are defined via a class with literal name of compute function and containing methods named `Call` that are parameterized for specific input types (signed/unsigned integer and floating-point).

Kernel dispatcher
-----------------

* compute/exec.h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* compute/exec.h
* `arrow/compute/exec.h <https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/exec.hc>`_

Register kernels of compute function
------------------------------------

1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A description of NotNull and other variants may be helpful, though this could also go in the main user documentation since it has performance and correctness implications.


1. Before registering the kernels, check that the available kernel generators support the `arity` and data types allowed for the new compute function. Kernel generators are not of the same form for all the kernel `kinds`. For example, in the "Scalar Arithmetic" kernels, registration functions have names of the form `MakeArithmeticFunction` and `MakeArithmeticFunctionNotNull`. If not available, you will need to define them for your particular case.

1. Create the kernels by invoking the kernel generators.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is vague. Add a link to an example kernel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe indicate that example code links are provided in a later section.


1. Create the kernels by invoking the kernel generators.

1. Register the kernels in the corresponding registry along with its `FunctionDoc`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A link would be helpful for this as well. Perhaps choose one kernel and use it to illustrate the points that are made.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe indicate that example code links are provided in a later section.

Comment on lines +260 to +261
1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
1. Input/output shapes: operate on scalars or element-wise for arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Input/output types: Numerical (signed and unsigned, integral and floating-point)
1. Input/output shapes: operate on scalars or element-wise for arrays
1. Input types: Numerical (signed and unsigned, integral and floating-point)
1. Input shapes: Operate on scalars or element-wise for arrays
1. Output types: Numerical, same as input
1. Output shapes: Same as input shape

The documentation does not combine Input and Output,
which improves clarity because dependence of output on input can be made clear.

Comment on lines +295 to +326
```C++
struct AbsoluteValue {
template <typename T, typename Arg>
static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
return (arg < static_cast<T>(0)) ? -arg : arg;
}

template <typename T, typename Arg>
static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
return arg;
}

template <typename T, typename Arg>
static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
}
};

struct AbsoluteValueChecked {
template <typename T, typename Arg>
static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
static_assert(std::is_same<T, Arg>::value, "");
if (arg < static_cast<T>(0)) {
T result = 0;
if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
*st = Status::Invalid("overflow");
}
return result;
}
return arg;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```C++
struct AbsoluteValue {
template <typename T, typename Arg>
static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
return (arg < static_cast<T>(0)) ? -arg : arg;
}
template <typename T, typename Arg>
static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
return arg;
}
template <typename T, typename Arg>
static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
}
};
struct AbsoluteValueChecked {
template <typename T, typename Arg>
static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
static_assert(std::is_same<T, Arg>::value, "");
if (arg < static_cast<T>(0)) {
T result = 0;
if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
*st = Status::Invalid("overflow");
}
return result;
}
return arg;
}
.. code-block:: cpp
struct AbsoluteValue {
template <typename T, typename Arg>
static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status*) {
return (arg < static_cast<T>(0)) ? -arg : arg;
}
template <typename T, typename Arg>
static constexpr enable_if_unsigned_integer<T> Call(KernelContext*, Arg arg, Status*) {
return arg;
}
template <typename T, typename Arg>
static constexpr enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
return (arg < static_cast<T>(0)) ? arrow::internal::SafeSignedNegate(arg) : arg;
}
};
struct AbsoluteValueChecked {
template <typename T, typename Arg>
static enable_if_signed_integer<T> Call(KernelContext*, Arg arg, Status* st) {
static_assert(std::is_same<T, Arg>::value, "");
if (arg < static_cast<T>(0)) {
T result = 0;
if (ARROW_PREDICT_FALSE(NegateWithOverflow(arg, &result))) {
*st = Status::Invalid("overflow");
}
return result;
}
return arg;
}

Use a restructured text code block

Comment on lines +327 to +339
template <typename T, typename Arg>
static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) {
static_assert(std::is_same<T, Arg>::value, "");
return arg;
}

template <typename T, typename Arg>
static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) {
static_assert(std::is_same<T, Arg>::value, "");
return (arg < static_cast<T>(0)) ? -arg : arg;
}
};
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
template <typename T, typename Arg>
static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) {
static_assert(std::is_same<T, Arg>::value, "");
return arg;
}
template <typename T, typename Arg>
static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) {
static_assert(std::is_same<T, Arg>::value, "");
return (arg < static_cast<T>(0)) ? -arg : arg;
}
};
```
template <typename T, typename Arg>
static enable_if_unsigned_integer<T> Call(KernelContext* ctx, Arg arg, Status* st) {
static_assert(std::is_same<T, Arg>::value, "");
return arg;
}
template <typename T, typename Arg>
static constexpr enable_if_floating_point<T> Call(KernelContext*, Arg arg, Status* st) {
static_assert(std::is_same<T, Arg>::value, "");
return (arg < static_cast<T>(0)) ? -arg : arg;
}
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing formatting to a restructured text code block.

Comment on lines +344 to +356
```C++
const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise",
("Results will wrap around on integer overflow.\n"
"Use function \"absolute_value_checked\" if you want overflow\n"
"to return an error."),
{"x"}};

const FunctionDoc absolute_value_checked_doc{
"Calculate the absolute value of the argument element-wise",
("This function returns an error on overflow. For a variant that\n"
"doesn't fail on overflow, use function \"absolute_value_checked\"."),
{"x"}};
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```C++
const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise",
("Results will wrap around on integer overflow.\n"
"Use function \"absolute_value_checked\" if you want overflow\n"
"to return an error."),
{"x"}};
const FunctionDoc absolute_value_checked_doc{
"Calculate the absolute value of the argument element-wise",
("This function returns an error on overflow. For a variant that\n"
"doesn't fail on overflow, use function \"absolute_value_checked\"."),
{"x"}};
```
.. code-block:: cpp
const FunctionDoc absolute_value_doc{"Calculate the absolute value of the argument element-wise",
("Results will wrap around on integer overflow.\n"
"Use function \"absolute_value_checked\" if you want overflow\n"
"to return an error."),
{"x"}};
const FunctionDoc absolute_value_checked_doc{
"Calculate the absolute value of the argument element-wise",
("This function returns an error on overflow. For a variant that\n"
"doesn't fail on overflow, use function \"absolute_value_checked\"."),
{"x"}};
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use restructured text code block

Comment on lines +365 to +368
```C++
auto absolute_value = MakeUnaryArithmeticFunction<AbsoluteValue>("absolute_value", &absolute_value_doc);
auto absolute_value_checked = MakeUnaryArithmeticFunctionNotNull<AbsoluteValueChecked>("absolute_value_checked", &absolute_value_checked_doc);
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```C++
auto absolute_value = MakeUnaryArithmeticFunction<AbsoluteValue>("absolute_value", &absolute_value_doc);
auto absolute_value_checked = MakeUnaryArithmeticFunctionNotNull<AbsoluteValueChecked>("absolute_value_checked", &absolute_value_checked_doc);
```
.. code-block:: cpp
auto absolute_value = MakeUnaryArithmeticFunction<AbsoluteValue>("absolute_value", &absolute_value_doc);
auto absolute_value_checked = MakeUnaryArithmeticFunctionNotNull<AbsoluteValueChecked>("absolute_value_checked", &absolute_value_checked_doc);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use restructured text code block

Comment on lines +371 to +374
```C++
DCHECK_OK(registry->AddFunction(std::move(absolute_value)));
DCHECK_OK(registry->AddFunction(std::move(absolute_value_checked)));
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```C++
DCHECK_OK(registry->AddFunction(std::move(absolute_value)));
DCHECK_OK(registry->AddFunction(std::move(absolute_value_checked)));
```
.. code-block:: cpp
DCHECK_OK(registry->AddFunction(std::move(absolute_value)));
DCHECK_OK(registry->AddFunction(std::move(absolute_value_checked)));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use restructured text code block

Comment on lines +407 to +410
```C++
ARROW_EXPORT
Result<Datum> Hypotenuse(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```C++
ARROW_EXPORT
Result<Datum> Hypotenuse(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(), ExecContext* ctx = NULLPTR);
```
.. code-block:: cpp
ARROW_EXPORT
Result<Datum> Hypotenuse(const Datum& arg, ArithmeticOptions options = ArithmeticOptions(),
ExecContext* ctx = NULLPTR);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use restructured text code block

Comment on lines +415 to +417
```C++
SCALAR_ARITHMETIC_BINARY(Hypotenuse, "hypotenuse", "hypotenuse_checked")
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```C++
SCALAR_ARITHMETIC_BINARY(Hypotenuse, "hypotenuse", "hypotenuse_checked")
```
.. code-block:: cpp
SCALAR_ARITHMETIC_BINARY(Hypotenuse, "hypotenuse", "hypotenuse_checked")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use restructured text code block

@edponce
Copy link
Contributor Author

edponce commented Dec 29, 2021

@bkmgit Thanks for your reviews! I will get back to this PR and resolve them.

Copy link
Contributor

@9prady9 9prady9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an image showing what goes where could be also a quick way to protray the hierarchy and how calls are made when a function is invoked.


Terminology:
* The term compute "function" refers to a particular general operation that may have many different implementations corresponding to different combinations of types or function behavior options.
* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must ensure a kernel corresponding to the value types of the inputs is selected.

or

Suggested change
* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must first select a suitable kernel corresponding to the value types of the inputs is selected.
* A specific implementation of a function is a "kernel". Selecting a viable kernel for executing a function is referred to as "dispatching". When executing a function on inputs, we must select a kernel corresponding to the value types of the inputs.

?

[Compute functions](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h) have the following principal attributes:
* A unique ["name"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4NK5arrow7compute8Function4nameEv) used for function invocation and language bindings
* A ["kind"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute8Function4KindE)
which indicates in what context it is valid for use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
which indicates in what context it is valid for use
indicates in what context it is valid for use

* Input/output [types](https://arrow.apache.org/docs/cpp/compute.html#type-categories) and [shapes](https://arrow.apache.org/docs/cpp/compute.html#input-shapes)
* Compute functions can also be further "categorized" based on the type of operation performed. For example, `Scalar Arithmetic` vs `Scalar String`.
* Compute functions (see [FunctionImpl and subclasses](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/function.h)) contain ["kernels"](https://github.com/edponce/arrow/tree/master/cpp/src/arrow/compute/kernels) which are implementations for specific argument signatures.
* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) which states the number of required arguments
* An ["arity"](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute5ArityE) states the number of required arguments

Files and structures of the computer layer
==========================================

This section describes the general structure of files/directory and principal code structures of the compute layer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This section describes the general structure of files/directory and principal code structures of the compute layer.
This section describes the general structure of files/directory and principal code structures of the compute layer using scalar function <pick something>.

followed by bullet points on how this function exists in code base. The raw information is present even now in the current list of points, but it feels like a flow of information is missing.

@pitrou
Copy link
Member

pitrou commented May 4, 2022

@edponce Are you planning to work on this in the future?

@edponce
Copy link
Contributor Author

edponce commented May 10, 2022

@pitrou Wow, I do not know why I extended this PR so much. I will take tomorrow morning to resolve all the issues and present a full draft.

@edponce
Copy link
Contributor Author

edponce commented May 10, 2022

Linking as reference this general outline for adding compute functions.

@edponce
Copy link
Contributor Author

edponce commented Jun 28, 2022

@drin
Copy link
Contributor

drin commented Aug 20, 2022

I plan on subsuming this PR into a new one (#13933). I will try to make sure all of these reviews are accommodated.

@drin
Copy link
Contributor

drin commented Aug 24, 2022

I accommodated most of these review comments in this commit: f0e3adf

@edponce I'm not sure what we should do with this PR, but I think it can be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants