-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surprising memoized behavior for mutable containers #341
Comments
The case where completely different objects are assumed to be the same may be improved by looking at least at the number of bytes that can be expected to differ:
|
Yes, this a know issue (read: "expected behaviour"). The reason the first 32 bytes are taken as part of The 32 bytes value was chosen arbitrarily as a "reasonable" trade-off between map key size and likelihood of collisions. Again, the goal was detecting ID reuse, not changes to mutable objects.
Yes, however that will make
This is how it was previously implemented (though you would again have to truncate at some point to avoid potentially keeping a large number of large strings perpetually in memory). In practice,
As you say, that would make There are only three possible actual solutions to this problem
None of these are workable for reasons that are either mentioned above, or are, hopefully, self-evident. Which means we're left with some non-perfect workaround. Current implementation of We can, however, at least update the documentation for |
As proposed in the 2nd comment, 32 bytes does not seem reasonable since they can be identical on objects that have nothing in common, as shown in my second example. This is probably largely due to common fields at the beginning of most objects (at least 16 bytes for PyObject AFAICS in [1]). A better heuristic could look like my previous comment using
[1] https://github.com/python/cpython/blob/master/Include/object.h#L109 Out of all the solutions, maintaining a reference on objects and indexing by ID would always yield a functionally correct result. Most use cases of
One could argue that although using the ID allows to use it with for example lists, it does not mean it will actually be useful. A call like
To summarize, the following behavior would be functionally correct while allowing flexible calls:
An even more accurate key can be built using signature handling of
(bonus) Getting the "generalized ID" can be somehow achieved by calling recursively |
EDIT: |
This what I meant when I said code gets "awkward". Our code actually has a lot of instances where we have lists that aren't mutated, or aren't mutated after a certain point (pretty much all ordered sequences in in WA configuration processing are lists, and once config gets finalized they typically don't get touched). So the "edge" case where a memoized function gets called with the same non-hashable object which could, but actually hasn't changed, is quite common. There is a number of reasons why lists are used instead of tuples, and simply converting everything to tuples isn't really an option. Doing the conversion just before calling a memoized function is in theory possible, but will be difficult to track (in this case, raising a la
ooh, I do like that idea. It's worth playing around with. |
The following piece of code will only print once, although
foo
is called with arguments that would not compare equal:That will also only print once, although
C
andD
are wildly different classes:The root cause is
__get_memo_id
that only checks the first 32 bytes of the underlying representation of aPyObject
, in addition to theid()
of the object. IDs can be reallocated, and the same object will not always compare equal to a past state of itself (generally true for any immutable object). Similar issues will arise with classes implementing__eq__
without implementing__hash__
, or with naive but nonetheless valid implementations like:This is allowed by:
Generally speaking, that issue cannot generally be solved, apart from only using
memoized
with hashable types (which rules out lists). A potential workaround would be to use therepr()
of the object, hoping that it will reflect the full state of the object.Another workaround can be to keep a reference to the parameters, so they are never destructed and we ensure that if objects are not the same, they will get a different ID. This would make
memoized
relatively useless for types likes lists, where calls likefoo([1,2])
would not be memoized anymore.Note: Python 3
functools.lru_cache
avoids that issue by not supporting non-hashable types.https://docs.python.org/3/library/functools.html#functools.lru_cache
The text was updated successfully, but these errors were encountered: