Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing issues related to dtype=object arrays in interpolation routines #1649

Conversation

leftaroundabout
Copy link
Contributor

In old versions of NumPy, ODL relied on its capability to represent ragged arrays automatically as arrays of arrays (i.e., of objects). This was in particular used for meshgrids, which are a kind of discretization supported by the interpolation classes in odl.discr.

Current NumPy does not automatically convert to dtype=object anymore, and for good reasons: it is error-prone (shapes become ambiguous, whether to consider the nested array or just its outer structure) and performance / memory locality suffers. In #1633, this was addressed by explicitly generating an object-array specifically for the meshgrid-specifying inputs, but further testing (#1648) revealed that this was not sufficient: the dtype=object property would percolate into the interpolation calculations, and there cause new failures due to required implicit conversion (as well as performance degradation).

This PR goes into the details of the interpolation routines and ensures linear arrays are stored with primitive dtype. It fixes the discretization tests in NumPy-1.19, though there are still some implicit conversions that the even stricter numpy-1.26 does not accept, as well as different tests that currently fail for unrelated reasons.

…e stored as arrays.

Without this, NumPy implicitly generates arrays but makes them ragged (dtype=object),
which is bad for performance and disabled in newer versions.
In older NumPy, this would silently create an array-of-object, but that
has a wrong shape too.
…ing.

Giving an inhomogeneous-arrays list as the result to a function may sometimes
be convenient, but the automatic conversion is both prone to hiding bugs and
always bad for performance, so better is to require that the result actually
has the correct shape or at least one that NumPy can directly broadcast to the
correct one.
…ject.

This would previously happen because meshgrids are passed in as ragged arrays,
and NumPy does not convert the rows to float dtype as should happen.
@pep8speaks
Copy link

pep8speaks commented Aug 8, 2024

Checking updated PR...

Line 1366:80: E501 line too long (98 > 79 characters)
Line 1029:80: E501 line too long (89 > 79 characters)
Line 951:80: E501 line too long (86 > 79 characters)
Line 936:1: E302 expected 2 blank lines, found 1
Line 629:20: E225 missing whitespace around operator
Line 629:19: E128 continuation line under-indented for visual indent

Comment last updated at 2024-08-29 16:09:09 UTC

@leftaroundabout
Copy link
Contributor Author

Note about the linter comments: I personally disagree with many of pep8speaks' suggestions, but in particular E126 is also by others considered over-pendantic and does not actually follow from the PEP8 style guide.

…riate.

One of the tests samples from an integral grid at non-integral points.
This failed after the explicit conversion introduced in c90044a.
Falling back to `float` fixes the test case, though perhaps it would
be better to select a dedicated meshing dtype.
…er of arguments.

The ODL tests checked against the previously raised `ValueError`, but in newer
versions `TypeError` is raised instead, which these tests could not handle.
…cation.

The partition builders are quite liberal (arguably, unnecessarily) in the format of the
boolean lists specifying along which dimensions the boundary should be included in
the grid and along which not. Previously, these free-form lists were passed through
NumPy as a first step of analysing their form, but NumPy itself is now more strict
in what can be in an array (unless explicitly asked with `dtype=object`, in which
case however a plain list could be used just as well).
These checks can also be done in a more direct fashion, presuming that the caller
actually follows the specification.
… eval for points_collocation.

Not doing this lead to some of the familiar implicit dtype=object fallbacks that
are illegal in modern NumPy.
@JevgenijaAksjonova JevgenijaAksjonova self-requested a review August 18, 2024 13:37
Copy link

@Emvlt Emvlt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent Work Done! ready for release :)

@@ -621,6 +621,11 @@ def _find_indices(self, x):

# iterate through dimensions
for xi, cvec in zip(x, self.coord_vecs):
try:
xi = np.asarray(xi).astype(self.values.dtype, casting='safe')
except TypeError:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a warning so the user knows the values are cast as floats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 53c7760.

@@ -1332,12 +1339,12 @@ def dual_use_func(x, out=None, **kwargs):
elif tensor_valued:
# The out object can be any array-like of objects with shapes
# that should all be broadcastable to scalar_out_shape.
results = np.array(out)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test relates to the function func_tens_oop, which checks that ODL broadcasts a list of float values to declare a numpy array to the appropriate numpy lingo. For now, what about adding a deprecation warning to this test?

@@ -1111,8 +1111,8 @@ def uniform_grid_fromintv(intv_prod, shape, nodes_on_bdry=True):

shape = normalized_scalar_param_list(shape, intv_prod.ndim, safe_int_conv)

if np.shape(nodes_on_bdry) == ():
nodes_on_bdry = ([(bool(nodes_on_bdry), bool(nodes_on_bdry))] *
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a type hinting for the nodes_on_bdry argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but am unsure how best to do it – as I often am in Python, where there is no unambiguous "correct type".

A direct version would be

def uniform_grid(min_pt, max_pt, shape,
        nodes_on_bdry: Union[bool, List[Union[bool, Tuple[bool,bool]]]] =True):
    ...

That expresses fairly well the different ways the function can be called, but I reckon it looks daunting to the user. An alternative could be to define this union type separately:

BoundarySpecification = Union[bool, List[Union[bool, Tuple[bool,bool]]]]

def uniform_grid(min_pt, max_pt, shape, nodes_on_bdry: BoundarySpecification =True):
    ...

This has the advantage that the type signature remains short, though it may also be less useful. And arguably, if we make a definition at all then it would be consequent to make it a dedicated class, though that seems overkill here.

It is also debatable whether List is the right thing. The documentation says "sequence". I a way tuples would be more natural (both because they're also used for array shapes, and because the type of each element can be different and lists are more commonly understood as homogeneous containers), but I find

def uniform_grid(min_pt, max_pt, shape,
        nodes_on_bdry: Union[bool, Tuple[Union[bool, Tuple[bool,bool]], ...]] =True):
    ...

more confusing than the version with List. I'm also not a fan of Iterable[...], that seems to suggest passing a lazily-generated sequence.

To be fair, the Sphinx documentation is actually pretty clear about what the nodes_on_bdry does and how it can be specified. Maybe a type hint causes more confusion than it clears up, and I should rather add some comments explaining what happens in the if isinstance(nodes_on_bdry, bool): and following lines?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is how I understand the nodes_on_bdry argument: nodes_on_bdry : Union[bool, Tuple[bool]]. If a bool is provided, it sets all boundaries. If a tuple is provided it functions as follows:

  1. check that the dimension of the tuple matches the dimension of the array: assert len(nodes_on_bdry) == array.dim
  2. We then have an array describing:
  • nodes_on_bdry = (node_on_bdry_left,node_on_bdry_right) in dimension 1
  • nodes_on_bdry = (node_on_bdry_left,node_on_bdry_right, node_on_bdry_top, node_on_bdry_bottom) in dimension 2
  • nodes_on_bdry = (node_on_bdry_left,node_on_bdry_right, node_on_bdry_top, node_on_bdry_bottom, node_on_bdry_front, node_on_bdry_back) in dimension 3

What do you think about that? :)

Copy link
Contributor Author

@leftaroundabout leftaroundabout Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Emvlt that's not how it currently works. Your examples should actually be written

  • nodes_on_bdry = ((node_on_bdry_left,node_on_bdry_right),) in dimension 1
  • nodes_on_bdry = ((node_on_bdry_left,node_on_bdry_right), (node_on_bdry_top, node_on_bdry_bottom)) in dimension 2
  • nodes_on_bdry = ((node_on_bdry_left,node_on_bdry_right), (node_on_bdry_top, node_on_bdry_bottom), (node_on_bdry_front, node_on_bdry_back))

So for example nodes_on_bdry = ((True, False), (True, True)) for a 2D domain that includes the boundary everywhere except in the right boundary. If that were all, the type hint could be written as

Tuple[Tuple[Bool,Bool], ...]

or alternatively and IMO clearer

List[Tuple[Bool,Bool]]

However, the specification can also be shortened to ((True, False), True). Or it could be (True, (False, True)) which means something different: all boundaries included except the top one. This makes it all more concise but leads to much messier types with a union nested inside a list.

I'm not sure if anybody really need this flexibility just to make their user code a few characters shorter. We could use a simplified type hint and then just remark in the docs that it can also be shortened by collapsing (False, False) to False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will come to a good decision here and then, let's leave the interface as it is for now. But I opened an issue to discuss the use of type hints for the future.

When this happens it is likely that the used did something like starting
from an integer mesh, but in that case linear interpolation does not seem
very appropriate.
out_list = [(bool(nodes_on_bdry[0]), bool(nodes_on_bdry[1]))]
if isinstance(nodes_on_bdry, bool):
return [(bool(nodes_on_bdry), bool(nodes_on_bdry))] * length
elif (length == 1 and len(nodes_on_bdry) == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool(nodes_on_bdry) can be replaced by nodes_on_bdry

# Some results don't have correct shape, need to
# broadcast
bcast_res = []
for res in results.ravel():
for res in out:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loops only through the outer dimension. I think we should either remove the broadcasting all together or make it general.

…ns case.

Previously, this was more or less automatically done in NumPy, but not anymore.
Normally, this is more likely to be user mistake. However, ODL actually
relies somewhat on such broadcasts when defining functions on "sparse"
mesh grids, so I added this functionality back be recursively
transversing lists of different-shape arrays (like in the old version
NumPy did, manually generating ragged dtype=object arrays).
… supporting."

This reverts commit 30626fe.

The test in question now succeeds after 77e17f3.
There is a check that the result must have the correct dtype, even
if it does not need to have the correct shape.
…ogeneously.

This might give slightly better performance in the expected homogeneous case.
leftaroundabout added a commit to leftaroundabout/odl that referenced this pull request Aug 29, 2024
…cation.

The partition builders are quite liberal (arguably, unnecessarily) in the format of the
boolean lists specifying along which dimensions the boundary should be included in
the grid and along which not. Previously, these free-form lists were passed through
NumPy as a first step of analysing their form, but NumPy itself is now more strict
in what can be in an array (unless explicitly asked with `dtype=object`, in which
case however a plain list could be used just as well).
These checks can also be done in a more direct fashion, presuming that the caller
actually follows the specification.

Amended due to feedback by Jevgenija in odlgroup#1649 (review).
@leftaroundabout
Copy link
Contributor Author

This PR has become messy through the corrections, and anyways covers multiple issues that are only coarsely related. I moved the changes to #1655 and #1656.

JevgenijaAksjonova pushed a commit that referenced this pull request Sep 4, 2024
* NumPy changed what exception is raised for ufunc-call with wrong number of arguments.

The ODL tests checked against the previously raised `ValueError`, but in newer
versions `TypeError` is raised instead, which these tests could not handle.

* Don't rely on obsolete NumPy inhomogenous arrays for boundary specification.

The partition builders are quite liberal (arguably, unnecessarily) in the format of the
boolean lists specifying along which dimensions the boundary should be included in
the grid and along which not. Previously, these free-form lists were passed through
NumPy as a first step of analysing their form, but NumPy itself is now more strict
in what can be in an array (unless explicitly asked with `dtype=object`, in which
case however a plain list could be used just as well).
These checks can also be done in a more direct fashion, presuming that the caller
actually follows the specification.

Amended due to feedback by Jevgenija in #1649 (review).

* Manual broadcasting for the general tensor-with-inconsistent-dimensions case.

Previously, this was more or less automatically done in NumPy, but not anymore.
Normally, this is more likely to be user mistake. However, ODL actually
relies somewhat on such broadcasts when defining functions on "sparse"
mesh grids, so I added this functionality back be recursively
transversing lists of different-shape arrays (like in the old version
NumPy did, manually generating ragged dtype=object arrays).

* Change doc test to have floating zero.

There is a check that the result must have the correct dtype, even
if it does not need to have the correct shape.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants