-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Current __hash__
implementation of OOI is broken
#4000
Comments
This ticket is related to #3808 and has to be solved either in that branched or before that branch is merged. |
Also we'll have to consider whether we want for:
to have different hashes (ie. |
My gut tells me we don't want them to be equal. Lets say we have a nibble that creates a finding when two URLs are similar, we would want the finding to exist on both urls. |
Started a RFD for this issue here: #4004 |
Would a subset of ooi fields be considered unique/similar enough to hash? I could imagine that an object still needs to be considered identical when for instance a field contains a timestamp (don't know if this is a likely scenario though). |
@underdarknl wrote in #4004
This makes sense from an XTDB perspective. From an Octopoes perspective, however, objects with the same primary key could differ there as attributes are mutated by Octopoes.
It is only a performance impact as long as we assure that the objects are unequal. The builtin python structures are able to deal with collisions at a performance cost. It is, however, ill practice to have very similar objects have the same hash, as we cannot guarantee that every implementation we use will handle such collisions properly.
In particular for nibbles the problem is twofold:
I will see if I can make such a test, in principle the hashing as done now is stable see: nl-kat-coordination/octopoes/tests/integration/test_nibbles.py Lines 317 to 346 in 1b6124f
but it has been justly questioned whether this is true for all cases on all platforms and such see: https://github.com/minvws/nl-kat-coordination/pull/3808/files/c4c561d469d88cb9f75cba403ebded744925e81f#r1893873228 Hope that satisfies at least parts of you requests. |
It depends on the usecase. For Uniqueness in the graph, the primary-key fields are enough. For detecting field-value changes when dealing with nibbles, the minimal set of fields that we could hash could be based on the full set of fields possibly accessed and or processes in that specific nibble. |
As mentioned before, I think it is important to separate XTDB and Octopoes cases. For the XTDB case primary keys are sufficient. Or to put it other words one puts two OOI's with the same primary-key/hash in a set/dict, what would be the desired behavior: everything else has to be implemented accordingly. One can use this code to fool around with the possibilities, by changing the
Note that there are possibly other routines that use |
Did everyone read the definition of hash, The only purpose of the hash is to provide an integer that can be used by hash tables. This is necessary for hash table based data structures such as sets and dictionaries. Defining The first requirement of The other requirement is that the hash value of an object must never change. This is logical given that it is used to determine the hash table bucket of the object, if the hash would change it means an object would be in the wrong bucket of the hash table. This is also the reason a mutable dictionary doesn't have a hash and can't be used in sets or as dictionary keys, because there is no way to statify both of these requirements. So is our current
I think in most places we create new python objects so that this doesn't result in problems. It would have been better if those attributes would have been immutable to prevent mistakes, but I don't think it is worth spending time on changing that given that we want to move to XTDB 2. Note that Django ORM does something similar as we do: it defines |
@dekkers this isn't an XTDB1/XTDB2 issue. The point is that Octopoes can generate OOI's with the same PK but with other attributes. While we have a way of dealing with that in XTDB (by "adding the fields") in the Octopoes/Python domain these are distinct objects that now have the same hash -- this is wrong, and that is what this issue is about -- and how to deal with it. |
@originalsouth After looking into this more I concluded the following. In terms of cryptographic hashing, as @dekkers notes Python's builtin In terms of equality checks and the usage of sets of OOIs, if we wouldn't have the equality check boiled in we would not consider them equal (see below). So as @dekkers's example shows, if we'd have e.g. two instances of "the same" IPPort with the same primary key where one has
So in my eyes this definition of And that brings me to what I think should perhaps be discussed here instead of the >>> {"a": 2, "b": 1} == {"b": 1, "a": 2}
True we shouldn't use |
The current
__hash__
implementation of OOI's:nl-kat-coordination/octopoes/octopoes/models/__init__.py
Lines 242 to 243 in 8730e18
is broken because it only considers the primary key; meaning that OOI's with fields not recorded in the primary key are erroneously deemed to be the same objects, causing Python's built-in hash dependent structures to find collapses.
Since we are dealing with OOI based on Pydantic BaseModel's we can easily generate a dict of the object using
model_dump
. Assuming that this is the best object to start to begin our__hash__
implementation on, the question becomes how to best hash a dict (as python still hasn't figured out how to do this natively).The natural question arises why not hash
model_dump_json
? Because there is no guarantee it is stable see pydantic/pydantic#10343.Hence, here I compare two algorithms with benchmarks:
Resulting in:
Personally, I would opt for
hasher_1
as it more flexible and faster, buthasher_2
is easier to maintain; also open to other suggestions.So how do we proceed to solve this problem?
The text was updated successfully, but these errors were encountered: