Fixing issues related to dtype=object arrays in interpolation routines #1649

leftaroundabout · 2024-08-08T15:18:52Z

In old versions of NumPy, ODL relied on its capability to represent ragged arrays automatically as arrays of arrays (i.e., of objects). This was in particular used for meshgrids, which are a kind of discretization supported by the interpolation classes in odl.discr.

Current NumPy does not automatically convert to dtype=object anymore, and for good reasons: it is error-prone (shapes become ambiguous, whether to consider the nested array or just its outer structure) and performance / memory locality suffers. In #1633, this was addressed by explicitly generating an object-array specifically for the meshgrid-specifying inputs, but further testing (#1648) revealed that this was not sufficient: the dtype=object property would percolate into the interpolation calculations, and there cause new failures due to required implicit conversion (as well as performance degradation).

This PR goes into the details of the interpolation routines and ensures linear arrays are stored with primitive dtype. It fixes the discretization tests in NumPy-1.19, though there are still some implicit conversions that the even stricter numpy-1.26 does not accept, as well as different tests that currently fail for unrelated reasons.

…e stored as arrays. Without this, NumPy implicitly generates arrays but makes them ragged (dtype=object), which is bad for performance and disabled in newer versions.

In older NumPy, this would silently create an array-of-object, but that has a wrong shape too.

…ing. Giving an inhomogeneous-arrays list as the result to a function may sometimes be convenient, but the automatic conversion is both prone to hiding bugs and always bad for performance, so better is to require that the result actually has the correct shape or at least one that NumPy can directly broadcast to the correct one.

…ject. This would previously happen because meshgrids are passed in as ragged arrays, and NumPy does not convert the rows to float dtype as should happen.

pep8speaks · 2024-08-08T15:18:58Z

Checking updated PR...

In the file odl/discr/discr_utils.py:

Line 1366:80: E501 line too long (98 > 79 characters)
Line 1029:80: E501 line too long (89 > 79 characters)
Line 951:80: E501 line too long (86 > 79 characters)
Line 936:1: E302 expected 2 blank lines, found 1
Line 629:20: E225 missing whitespace around operator
Line 629:19: E128 continuation line under-indented for visual indent

Comment last updated at 2024-08-29 16:09:09 UTC

leftaroundabout · 2024-08-08T15:41:00Z

Note about the linter comments: I personally disagree with many of pep8speaks' suggestions, but in particular E126 is also by others considered over-pendantic and does not actually follow from the PEP8 style guide.

…riate. One of the tests samples from an integral grid at non-integral points. This failed after the explicit conversion introduced in c90044a. Falling back to `float` fixes the test case, though perhaps it would be better to select a dedicated meshing dtype.

…er of arguments. The ODL tests checked against the previously raised `ValueError`, but in newer versions `TypeError` is raised instead, which these tests could not handle.

…cation. The partition builders are quite liberal (arguably, unnecessarily) in the format of the boolean lists specifying along which dimensions the boundary should be included in the grid and along which not. Previously, these free-form lists were passed through NumPy as a first step of analysing their form, but NumPy itself is now more strict in what can be in an array (unless explicitly asked with `dtype=object`, in which case however a plain list could be used just as well). These checks can also be done in a more direct fashion, presuming that the caller actually follows the specification.

… eval for points_collocation. Not doing this lead to some of the familiar implicit dtype=object fallbacks that are illegal in modern NumPy.

Emvlt

Excellent Work Done! ready for release :)

Emvlt · 2024-08-27T11:43:04Z

odl/discr/discr_utils.py

@@ -621,6 +621,11 @@ def _find_indices(self, x):

        # iterate through dimensions
        for xi, cvec in zip(x, self.coord_vecs):
+            try:
+                xi = np.asarray(xi).astype(self.values.dtype, casting='safe')
+            except TypeError:


Can you please add a warning so the user knows the values are cast as floats?

Done in 53c7760.

Emvlt · 2024-08-27T12:02:18Z

odl/discr/discr_utils.py

@@ -1332,12 +1339,12 @@ def dual_use_func(x, out=None, **kwargs):
            elif tensor_valued:
                # The out object can be any array-like of objects with shapes
                # that should all be broadcastable to scalar_out_shape.
-                results = np.array(out)


This test relates to the function func_tens_oop, which checks that ODL broadcasts a list of float values to declare a numpy array to the appropriate numpy lingo. For now, what about adding a deprecation warning to this test?

Emvlt · 2024-08-27T12:07:55Z

odl/discr/grid.py

@@ -1111,8 +1111,8 @@ def uniform_grid_fromintv(intv_prod, shape, nodes_on_bdry=True):

    shape = normalized_scalar_param_list(shape, intv_prod.ndim, safe_int_conv)

-    if np.shape(nodes_on_bdry) == ():
-        nodes_on_bdry = ([(bool(nodes_on_bdry), bool(nodes_on_bdry))] *


There should be a type hinting for the nodes_on_bdry argument

I agree, but am unsure how best to do it – as I often am in Python, where there is no unambiguous "correct type".

A direct version would be

def uniform_grid(min_pt, max_pt, shape, nodes_on_bdry: Union[bool, List[Union[bool, Tuple[bool,bool]]]] =True): ...

That expresses fairly well the different ways the function can be called, but I reckon it looks daunting to the user. An alternative could be to define this union type separately:

BoundarySpecification = Union[bool, List[Union[bool, Tuple[bool,bool]]]] def uniform_grid(min_pt, max_pt, shape, nodes_on_bdry: BoundarySpecification =True): ...

This has the advantage that the type signature remains short, though it may also be less useful. And arguably, if we make a definition at all then it would be consequent to make it a dedicated class, though that seems overkill here.

It is also debatable whether List is the right thing. The documentation says "sequence". I a way tuples would be more natural (both because they're also used for array shapes, and because the type of each element can be different and lists are more commonly understood as homogeneous containers), but I find

def uniform_grid(min_pt, max_pt, shape, nodes_on_bdry: Union[bool, Tuple[Union[bool, Tuple[bool,bool]], ...]] =True): ...

more confusing than the version with List. I'm also not a fan of Iterable[...], that seems to suggest passing a lazily-generated sequence.

To be fair, the Sphinx documentation is actually pretty clear about what the nodes_on_bdry does and how it can be specified. Maybe a type hint causes more confusion than it clears up, and I should rather add some comments explaining what happens in the if isinstance(nodes_on_bdry, bool): and following lines?

Here is how I understand the nodes_on_bdry argument: nodes_on_bdry : Union[bool, Tuple[bool]]. If a bool is provided, it sets all boundaries. If a tuple is provided it functions as follows:

check that the dimension of the tuple matches the dimension of the array: assert len(nodes_on_bdry) == array.dim

We then have an array describing:

nodes_on_bdry = (node_on_bdry_left,node_on_bdry_right) in dimension 1

nodes_on_bdry = (node_on_bdry_left,node_on_bdry_right, node_on_bdry_top, node_on_bdry_bottom) in dimension 2

nodes_on_bdry = (node_on_bdry_left,node_on_bdry_right, node_on_bdry_top, node_on_bdry_bottom, node_on_bdry_front, node_on_bdry_back) in dimension 3

What do you think about that? :)

@Emvlt that's not how it currently works. Your examples should actually be written

nodes_on_bdry = ((node_on_bdry_left,node_on_bdry_right),) in dimension 1

nodes_on_bdry = ((node_on_bdry_left,node_on_bdry_right), (node_on_bdry_top, node_on_bdry_bottom)) in dimension 2

nodes_on_bdry = ((node_on_bdry_left,node_on_bdry_right), (node_on_bdry_top, node_on_bdry_bottom), (node_on_bdry_front, node_on_bdry_back))

So for example nodes_on_bdry = ((True, False), (True, True)) for a 2D domain that includes the boundary everywhere except in the right boundary. If that were all, the type hint could be written as

Tuple[Tuple[Bool,Bool], ...]

or alternatively and IMO clearer

List[Tuple[Bool,Bool]]

However, the specification can also be shortened to ((True, False), True). Or it could be (True, (False, True)) which means something different: all boundaries included except the top one. This makes it all more concise but leads to much messier types with a union nested inside a list.

I'm not sure if anybody really need this flexibility just to make their user code a few characters shorter. We could use a simplified type hint and then just remark in the docs that it can also be shortened by collapsing (False, False) to False.

I don't think we will come to a good decision here and then, let's leave the interface as it is for now. But I opened an issue to discuss the use of type hints for the future.

When this happens it is likely that the used did something like starting from an integer mesh, but in that case linear interpolation does not seem very appropriate.

JevgenijaAksjonova · 2024-08-29T13:39:46Z

odl/util/normalize.py

-        out_list = [(bool(nodes_on_bdry[0]), bool(nodes_on_bdry[1]))]
+    if isinstance(nodes_on_bdry, bool):
+        return [(bool(nodes_on_bdry), bool(nodes_on_bdry))] * length
+    elif (length == 1 and len(nodes_on_bdry) == 2


bool(nodes_on_bdry) can be replaced by nodes_on_bdry

JevgenijaAksjonova · 2024-08-29T13:43:38Z

odl/discr/discr_utils.py

                    # Some results don't have correct shape, need to
                    # broadcast
                    bcast_res = []
-                    for res in results.ravel():
+                    for res in out:


This loops only through the outer dimension. I think we should either remove the broadcasting all together or make it general.

…ns case. Previously, this was more or less automatically done in NumPy, but not anymore. Normally, this is more likely to be user mistake. However, ODL actually relies somewhat on such broadcasts when defining functions on "sparse" mesh grids, so I added this functionality back be recursively transversing lists of different-shape arrays (like in the old version NumPy did, manually generating ragged dtype=object arrays).

… supporting." This reverts commit 30626fe. The test in question now succeeds after 77e17f3.

There is a check that the result must have the correct dtype, even if it does not need to have the correct shape.

Pointed out by Jevgenija in odlgroup#1649 (review).

…ogeneously. This might give slightly better performance in the expected homogeneous case.

…dcasting utility.

…cation. The partition builders are quite liberal (arguably, unnecessarily) in the format of the boolean lists specifying along which dimensions the boundary should be included in the grid and along which not. Previously, these free-form lists were passed through NumPy as a first step of analysing their form, but NumPy itself is now more strict in what can be in an array (unless explicitly asked with `dtype=object`, in which case however a plain list could be used just as well). These checks can also be done in a more direct fashion, presuming that the caller actually follows the specification. Amended due to feedback by Jevgenija in odlgroup#1649 (review).

leftaroundabout · 2024-08-29T17:04:04Z

This PR has become messy through the corrections, and anyways covers multiple issues that are only coarsely related. I moved the changes to #1655 and #1656.

* NumPy changed what exception is raised for ufunc-call with wrong number of arguments. The ODL tests checked against the previously raised `ValueError`, but in newer versions `TypeError` is raised instead, which these tests could not handle. * Don't rely on obsolete NumPy inhomogenous arrays for boundary specification. The partition builders are quite liberal (arguably, unnecessarily) in the format of the boolean lists specifying along which dimensions the boundary should be included in the grid and along which not. Previously, these free-form lists were passed through NumPy as a first step of analysing their form, but NumPy itself is now more strict in what can be in an array (unless explicitly asked with `dtype=object`, in which case however a plain list could be used just as well). These checks can also be done in a more direct fashion, presuming that the caller actually follows the specification. Amended due to feedback by Jevgenija in #1649 (review). * Manual broadcasting for the general tensor-with-inconsistent-dimensions case. Previously, this was more or less automatically done in NumPy, but not anymore. Normally, this is more likely to be user mistake. However, ODL actually relies somewhat on such broadcasts when defining functions on "sparse" mesh grids, so I added this functionality back be recursively transversing lists of different-shape arrays (like in the old version NumPy did, manually generating ragged dtype=object arrays). * Change doc test to have floating zero. There is a check that the result must have the correct dtype, even if it does not need to have the correct shape.

leftaroundabout added 5 commits August 8, 2024 17:02

Ensure the distances to computed linear-interpolation weights from ar…

2b3edef

…e stored as arrays. Without this, NumPy implicitly generates arrays but makes them ragged (dtype=object), which is bad for performance and disabled in newer versions.

Failure to convert to array implies it is not a suitable input array.

80b1d8e

In older NumPy, this would silently create an array-of-object, but that has a wrong shape too.

Ensure interpolation weight is computed with the correct array type.

380ee09

Ensure mesh coordinate calculations are not carried out with dtype=ob…

c90044a

…ject. This would previously happen because meshgrids are passed in as ragged arrays, and NumPy does not convert the rows to float dtype as should happen.

Linter whitespace rules.

bd73b42

leftaroundabout added 5 commits August 9, 2024 16:15

NumPy changed what exception is raised for ufunc-call with wrong numb…

3e3ea87

…er of arguments. The ODL tests checked against the previously raised `ValueError`, but in newer versions `TypeError` is raised instead, which these tests could not handle.

Explicitly go through elements that may need to be broadcasted in fn.…

b6a912a

… eval for points_collocation. Not doing this lead to some of the familiar implicit dtype=object fallbacks that are illegal in modern NumPy.

Coding style details criticised by pep8speaks.

2828057

JevgenijaAksjonova self-requested a review August 18, 2024 13:37

Emvlt reviewed Aug 27, 2024

View reviewed changes

Add warning when falling back to float for interpolation coefficients.

53c7760

When this happens it is likely that the used did something like starting from an integer mesh, but in that case linear interpolation does not seem very appropriate.

JevgenijaAksjonova reviewed Aug 29, 2024

View reviewed changes

leftaroundabout added 6 commits August 29, 2024 16:39

Revert "Disable a test that currently fails and is probably not worth…

e80c43d

… supporting." This reverts commit 30626fe. The test in question now succeeds after 77e17f3.

Change doc test to have floating zero.

3b38e7f

There is a check that the result must have the correct dtype, even if it does not need to have the correct shape.

Omit unnecessary type coercion.

7c5e31f

Pointed out by Jevgenija in odlgroup#1649 (review).

Only perform manual broadcasting in case NumPy fails to broadcast hom…

9b3f4f4

…ogeneously. This might give slightly better performance in the expected homogeneous case.

Explicitly pass domain dimension as an argument to inhomogeneous-broa…

de6bc52

…dcasting utility.

leftaroundabout mentioned this pull request Aug 29, 2024

Fixing issues related to dtype=object arrays in interpolation routines #1655

Merged

leftaroundabout mentioned this pull request Aug 29, 2024

dtype=object array problems in function-on-grid evaluation #1656

Merged

leftaroundabout closed this Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing issues related to dtype=object arrays in interpolation routines #1649

Fixing issues related to dtype=object arrays in interpolation routines #1649

leftaroundabout commented Aug 8, 2024

pep8speaks commented Aug 8, 2024 •

edited

Loading

leftaroundabout commented Aug 8, 2024

Emvlt left a comment

Emvlt Aug 27, 2024

leftaroundabout Aug 27, 2024

Emvlt Aug 27, 2024

Emvlt Aug 27, 2024

leftaroundabout Aug 27, 2024

Emvlt Aug 28, 2024

leftaroundabout Aug 28, 2024 •

edited

Loading

leftaroundabout Aug 29, 2024

JevgenijaAksjonova Aug 29, 2024

JevgenijaAksjonova Aug 29, 2024

leftaroundabout commented Aug 29, 2024

Fixing issues related to dtype=object arrays in interpolation routines #1649

Fixing issues related to dtype=object arrays in interpolation routines #1649

Conversation

leftaroundabout commented Aug 8, 2024

pep8speaks commented Aug 8, 2024 • edited Loading

Comment last updated at 2024-08-29 16:09:09 UTC

leftaroundabout commented Aug 8, 2024

Emvlt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leftaroundabout Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leftaroundabout commented Aug 29, 2024

pep8speaks commented Aug 8, 2024 •

edited

Loading

leftaroundabout Aug 28, 2024 •

edited

Loading