-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing issues related to dtype=object arrays in interpolation routines #1649
Fixing issues related to dtype=object arrays in interpolation routines #1649
Conversation
…e stored as arrays. Without this, NumPy implicitly generates arrays but makes them ragged (dtype=object), which is bad for performance and disabled in newer versions.
In older NumPy, this would silently create an array-of-object, but that has a wrong shape too.
…ing. Giving an inhomogeneous-arrays list as the result to a function may sometimes be convenient, but the automatic conversion is both prone to hiding bugs and always bad for performance, so better is to require that the result actually has the correct shape or at least one that NumPy can directly broadcast to the correct one.
…ject. This would previously happen because meshgrids are passed in as ragged arrays, and NumPy does not convert the rows to float dtype as should happen.
Checking updated PR...
Comment last updated at 2024-08-29 16:09:09 UTC |
Note about the linter comments: I personally disagree with many of pep8speaks' suggestions, but in particular E126 is also by others considered over-pendantic and does not actually follow from the PEP8 style guide. |
…riate. One of the tests samples from an integral grid at non-integral points. This failed after the explicit conversion introduced in c90044a. Falling back to `float` fixes the test case, though perhaps it would be better to select a dedicated meshing dtype.
…er of arguments. The ODL tests checked against the previously raised `ValueError`, but in newer versions `TypeError` is raised instead, which these tests could not handle.
…cation. The partition builders are quite liberal (arguably, unnecessarily) in the format of the boolean lists specifying along which dimensions the boundary should be included in the grid and along which not. Previously, these free-form lists were passed through NumPy as a first step of analysing their form, but NumPy itself is now more strict in what can be in an array (unless explicitly asked with `dtype=object`, in which case however a plain list could be used just as well). These checks can also be done in a more direct fashion, presuming that the caller actually follows the specification.
… eval for points_collocation. Not doing this lead to some of the familiar implicit dtype=object fallbacks that are illegal in modern NumPy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent Work Done! ready for release :)
@@ -621,6 +621,11 @@ def _find_indices(self, x): | |||
|
|||
# iterate through dimensions | |||
for xi, cvec in zip(x, self.coord_vecs): | |||
try: | |||
xi = np.asarray(xi).astype(self.values.dtype, casting='safe') | |||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a warning so the user knows the values are cast as floats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 53c7760.
@@ -1332,12 +1339,12 @@ def dual_use_func(x, out=None, **kwargs): | |||
elif tensor_valued: | |||
# The out object can be any array-like of objects with shapes | |||
# that should all be broadcastable to scalar_out_shape. | |||
results = np.array(out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test relates to the function func_tens_oop
, which checks that ODL broadcasts a list of float values to declare a numpy array to the appropriate numpy lingo. For now, what about adding a deprecation warning to this test?
@@ -1111,8 +1111,8 @@ def uniform_grid_fromintv(intv_prod, shape, nodes_on_bdry=True): | |||
|
|||
shape = normalized_scalar_param_list(shape, intv_prod.ndim, safe_int_conv) | |||
|
|||
if np.shape(nodes_on_bdry) == (): | |||
nodes_on_bdry = ([(bool(nodes_on_bdry), bool(nodes_on_bdry))] * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a type hinting for the nodes_on_bdry
argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but am unsure how best to do it – as I often am in Python, where there is no unambiguous "correct type".
A direct version would be
def uniform_grid(min_pt, max_pt, shape,
nodes_on_bdry: Union[bool, List[Union[bool, Tuple[bool,bool]]]] =True):
...
That expresses fairly well the different ways the function can be called, but I reckon it looks daunting to the user. An alternative could be to define this union type separately:
BoundarySpecification = Union[bool, List[Union[bool, Tuple[bool,bool]]]]
def uniform_grid(min_pt, max_pt, shape, nodes_on_bdry: BoundarySpecification =True):
...
This has the advantage that the type signature remains short, though it may also be less useful. And arguably, if we make a definition at all then it would be consequent to make it a dedicated class, though that seems overkill here.
It is also debatable whether List
is the right thing. The documentation says "sequence". I a way tuples would be more natural (both because they're also used for array shapes, and because the type of each element can be different and lists are more commonly understood as homogeneous containers), but I find
def uniform_grid(min_pt, max_pt, shape,
nodes_on_bdry: Union[bool, Tuple[Union[bool, Tuple[bool,bool]], ...]] =True):
...
more confusing than the version with List
. I'm also not a fan of Iterable[...]
, that seems to suggest passing a lazily-generated sequence.
To be fair, the Sphinx documentation is actually pretty clear about what the nodes_on_bdry
does and how it can be specified. Maybe a type hint causes more confusion than it clears up, and I should rather add some comments explaining what happens in the if isinstance(nodes_on_bdry, bool):
and following lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is how I understand the nodes_on_bdry argument: nodes_on_bdry : Union[bool, Tuple[bool]]
. If a bool is provided, it sets all boundaries. If a tuple is provided it functions as follows:
- check that the dimension of the tuple matches the dimension of the array: a
ssert len(nodes_on_bdry) == array.dim
- We then have an array describing:
nodes_on_bdry = (node_on_bdry_left,node_on_bdry_right)
in dimension 1nodes_on_bdry = (node_on_bdry_left,node_on_bdry_right, node_on_bdry_top, node_on_bdry_bottom)
in dimension 2nodes_on_bdry = (node_on_bdry_left,node_on_bdry_right, node_on_bdry_top, node_on_bdry_bottom, node_on_bdry_front, node_on_bdry_back)
in dimension 3
What do you think about that? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Emvlt that's not how it currently works. Your examples should actually be written
nodes_on_bdry = ((node_on_bdry_left,node_on_bdry_right),)
in dimension 1nodes_on_bdry = ((node_on_bdry_left,node_on_bdry_right), (node_on_bdry_top, node_on_bdry_bottom))
in dimension 2nodes_on_bdry = ((node_on_bdry_left,node_on_bdry_right), (node_on_bdry_top, node_on_bdry_bottom), (node_on_bdry_front, node_on_bdry_back))
So for example nodes_on_bdry = ((True, False), (True, True))
for a 2D domain that includes the boundary everywhere except in the right boundary. If that were all, the type hint could be written as
Tuple[Tuple[Bool,Bool], ...]
or alternatively and IMO clearer
List[Tuple[Bool,Bool]]
However, the specification can also be shortened to ((True, False), True)
. Or it could be (True, (False, True))
which means something different: all boundaries included except the top one. This makes it all more concise but leads to much messier types with a union nested inside a list.
I'm not sure if anybody really need this flexibility just to make their user code a few characters shorter. We could use a simplified type hint and then just remark in the docs that it can also be shortened by collapsing (False, False)
to False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we will come to a good decision here and then, let's leave the interface as it is for now. But I opened an issue to discuss the use of type hints for the future.
When this happens it is likely that the used did something like starting from an integer mesh, but in that case linear interpolation does not seem very appropriate.
out_list = [(bool(nodes_on_bdry[0]), bool(nodes_on_bdry[1]))] | ||
if isinstance(nodes_on_bdry, bool): | ||
return [(bool(nodes_on_bdry), bool(nodes_on_bdry))] * length | ||
elif (length == 1 and len(nodes_on_bdry) == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool(nodes_on_bdry) can be replaced by nodes_on_bdry
odl/discr/discr_utils.py
Outdated
# Some results don't have correct shape, need to | ||
# broadcast | ||
bcast_res = [] | ||
for res in results.ravel(): | ||
for res in out: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loops only through the outer dimension. I think we should either remove the broadcasting all together or make it general.
…ns case. Previously, this was more or less automatically done in NumPy, but not anymore. Normally, this is more likely to be user mistake. However, ODL actually relies somewhat on such broadcasts when defining functions on "sparse" mesh grids, so I added this functionality back be recursively transversing lists of different-shape arrays (like in the old version NumPy did, manually generating ragged dtype=object arrays).
There is a check that the result must have the correct dtype, even if it does not need to have the correct shape.
Pointed out by Jevgenija in odlgroup#1649 (review).
…ogeneously. This might give slightly better performance in the expected homogeneous case.
…dcasting utility.
…cation. The partition builders are quite liberal (arguably, unnecessarily) in the format of the boolean lists specifying along which dimensions the boundary should be included in the grid and along which not. Previously, these free-form lists were passed through NumPy as a first step of analysing their form, but NumPy itself is now more strict in what can be in an array (unless explicitly asked with `dtype=object`, in which case however a plain list could be used just as well). These checks can also be done in a more direct fashion, presuming that the caller actually follows the specification. Amended due to feedback by Jevgenija in odlgroup#1649 (review).
* NumPy changed what exception is raised for ufunc-call with wrong number of arguments. The ODL tests checked against the previously raised `ValueError`, but in newer versions `TypeError` is raised instead, which these tests could not handle. * Don't rely on obsolete NumPy inhomogenous arrays for boundary specification. The partition builders are quite liberal (arguably, unnecessarily) in the format of the boolean lists specifying along which dimensions the boundary should be included in the grid and along which not. Previously, these free-form lists were passed through NumPy as a first step of analysing their form, but NumPy itself is now more strict in what can be in an array (unless explicitly asked with `dtype=object`, in which case however a plain list could be used just as well). These checks can also be done in a more direct fashion, presuming that the caller actually follows the specification. Amended due to feedback by Jevgenija in #1649 (review). * Manual broadcasting for the general tensor-with-inconsistent-dimensions case. Previously, this was more or less automatically done in NumPy, but not anymore. Normally, this is more likely to be user mistake. However, ODL actually relies somewhat on such broadcasts when defining functions on "sparse" mesh grids, so I added this functionality back be recursively transversing lists of different-shape arrays (like in the old version NumPy did, manually generating ragged dtype=object arrays). * Change doc test to have floating zero. There is a check that the result must have the correct dtype, even if it does not need to have the correct shape.
In old versions of NumPy, ODL relied on its capability to represent ragged arrays automatically as arrays of arrays (i.e., of objects). This was in particular used for meshgrids, which are a kind of discretization supported by the interpolation classes in
odl.discr
.Current NumPy does not automatically convert to dtype=object anymore, and for good reasons: it is error-prone (shapes become ambiguous, whether to consider the nested array or just its outer structure) and performance / memory locality suffers. In #1633, this was addressed by explicitly generating an object-array specifically for the meshgrid-specifying inputs, but further testing (#1648) revealed that this was not sufficient: the
dtype=object
property would percolate into the interpolation calculations, and there cause new failures due to required implicit conversion (as well as performance degradation).This PR goes into the details of the interpolation routines and ensures linear arrays are stored with primitive dtype. It fixes the discretization tests in NumPy-1.19, though there are still some implicit conversions that the even stricter numpy-1.26 does not accept, as well as different tests that currently fail for unrelated reasons.