Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current __hash__ implementation of OOI is broken #4000

Open
originalsouth opened this issue Dec 31, 2024 · 12 comments
Open

Current __hash__ implementation of OOI is broken #4000

originalsouth opened this issue Dec 31, 2024 · 12 comments
Labels
bug Something isn't working discussion octopoes Issues related to octopoes

Comments

@originalsouth
Copy link
Contributor

The current __hash__ implementation of OOI's:

def __hash__(self):
return hash(self.primary_key)

is broken because it only considers the primary key; meaning that OOI's with fields not recorded in the primary key are erroneously deemed to be the same objects, causing Python's built-in hash dependent structures to find collapses.

Since we are dealing with OOI based on Pydantic BaseModel's we can easily generate a dict of the object using model_dump. Assuming that this is the best object to start to begin our __hash__ implementation on, the question becomes how to best hash a dict (as python still hasn't figured out how to do this natively).

The natural question arises why not hash model_dump_json? Because there is no guarantee it is stable see pydantic/pydantic#10343.

Hence, here I compare two algorithms with benchmarks:

#!/usr/bin/env python

from typing import Iterable, Any
import jcs
import random
import string
import time

N = 8
MAX_DEPTH = 4
K = 10**6


def random_string():
    return ''.join(random.choice(string.ascii_letters) for _ in range(random.randint(1, N)))


def random_obj(depth: int = 0, k: int = 7) -> Any:
    if depth < MAX_DEPTH:
        rand_choice = random.randint(0, k)
        if rand_choice == 0:
            return random.random()  # float
        elif rand_choice == 1:
            return random.randint(0, 10**6)  # int
        elif rand_choice == 2:
            return random_string()  # string
        elif rand_choice == 3:
            return [random_obj(depth + 1, 2) for _ in range(random.randint(1, N))]  # list
        elif rand_choice == 4:
            return [random_obj(depth + 1, 2) for _ in range(random.randint(1, N))]  # list
            # Non JSON compatible so broken for hasher_2 but hasher_1 digests it
            # return {random_obj(depth + 1, 2) for _ in range(random.randint(1, N))}  # set
        else:
            return {random_string(): random_obj(depth + 1) for _ in range(random.randint(1, N))}  # dict[str, Any]
            # Non JSON compatible so broken for hasher_2 but hasher_1 digests it
            # return {random_obj(depth + 1, 2): random_obj(depth + 1) for _ in range(random.randint(1, N))}  # dict[Any, Any]
    else:
        return random_string()


targets = [random_obj() for _ in range(K)]


def hasher_1(obj: Any) -> int:
    def freeze(obj: Iterable[Any | Iterable[Any]]) -> Iterable[int]:
        if isinstance(obj, Iterable) and not isinstance(obj, (str, bytes)):
            if isinstance(obj, dict):
                for key, value in obj.items():
                    yield hash(key)
                    yield from freeze(value)
            else:
                for item in obj:
                    yield from freeze(item)
        else:
            yield hash(obj)
    return hash(tuple(freeze(obj)))


def hasher_2(obj: Any) -> int:
    return hash(jcs.canonicalize(obj))


st = time.time_ns()
_ = list(map(hasher_1, targets))
dt = time.time_ns() - st

print(f"hasher_1: {dt / 10**9 / K}s")


st = time.time_ns()
_ = list(map(hasher_2, targets))
dt = time.time_ns() - st

print(f"hasher_2: {dt / 10**9 / K}s")

Resulting in:

hasher_1: 2.213041571e-05s
hasher_2: 3.159127834e-05s

Personally, I would opt for hasher_1 as it more flexible and faster, but hasher_2 is easier to maintain; also open to other suggestions.

So how do we proceed to solve this problem?

@originalsouth originalsouth added bug Something isn't working discussion octopoes Issues related to octopoes labels Dec 31, 2024
@github-project-automation github-project-automation bot moved this to Incoming features / Need assessment in KAT Dec 31, 2024
@originalsouth originalsouth moved this from Incoming features / Need assessment to To be discussed in KAT Dec 31, 2024
@originalsouth
Copy link
Contributor Author

This ticket is related to #3808 and has to be solved either in that branched or before that branch is merged.

@originalsouth
Copy link
Contributor Author

originalsouth commented Jan 1, 2025

Also we'll have to consider whether we want for:

d1 = {'a': 1, 'b': 2}
d2 = {'b': 2, 'a': 1}

to have different hashes (ie. hash(d1) == hash(d2) or hash(d1) != hash(d2)).
(this because python dicts are ordered)

@noamblitz
Copy link
Contributor

Also we'll have to consider whether we want for:

d1 = {'a': 1, 'b': 2}
d2 = {'b': 2, 'a': 1}

to have different hashes (ie. hash(d1) == hash(d2) or hash(d1) != hash(d2)). (this because python dicts are ordered)

My gut tells me we don't want them to be equal. Lets say we have a nibble that creates a finding when two URLs are similar, we would want the finding to exist on both urls.

@originalsouth
Copy link
Contributor Author

originalsouth commented Jan 1, 2025

Right (although, don't understand your reasoning...) but this plays counter-intuitively with pydantic it seems... pydantic seems to work accordingly

@jpbruinsslot
Copy link
Contributor

Started a RFD for this issue here: #4004

@jpbruinsslot
Copy link
Contributor

Would a subset of ooi fields be considered unique/similar enough to hash? I could imagine that an object still needs to be considered identical when for instance a field contains a timestamp (don't know if this is a likely scenario though).

@originalsouth
Copy link
Contributor Author

@underdarknl wrote in #4004

I'd like a little more explanation on why we chose to override the default hashing and implement hashing only on the primary-key fields: Because we cannot have two different objects in the graph with the same primary key, it makes sense to also have the rest of the codebase 'think' they are the same by hashing only those fields that would lead to different objects in the graph.

This makes sense from an XTDB perspective. From an Octopoes perspective, however, objects with the same primary key could differ there as attributes are mutated by Octopoes.

I'd also like a little more info on why and where this is problematic for the python code itself if we disregard the notion that we can only keep one version of two objects in the database If their primary key is the same.

It is only a performance impact as long as we assure that the objects are unequal. The builtin python structures are able to deal with collisions at a performance cost. It is, however, ill practice to have very similar objects have the same hash, as we cannot guarantee that every implementation we use will handle such collisions properly.

Furthermore a bit more details on why this specifically is a problem in the context of dealing with Objects in nibbles where we want to 'see' changes on any field that is used/consumed/read/queried by the nibble, regardless of what the primary key's values are. For example, having TTL for a DNSRecord in the primary key is useless for our Graph, but you could still envision a nibble that alerts users of DNSRecords with TTL's that are either too long, or too short. Not being able to see TTL changes (because from the outside the hash of the ooi stayed the same) would mean we only get to run the Nibble once, which is obviously not what you'd want.

In particular for nibbles the problem is twofold:
1.Nibbles xxhash OOI's for recalculation, the underlying serialization (necessary for the resolution of this problem) must be reliable for hashing.
2. Nibbles are written with uniqueness of objects in mind and uses dicts and sets as such; they will likely suffer performance impact. Both because the data structures get slower and because nibbles will run/rerun unexpectedly.

n.b. A test nibble doing exactly this (min / max on TTL based on a config) would be a great way of demonstrating this, and adding functionality.

I will see if I can make such a test, in principle the hashing as done now is stable see:

def test_nibbles_update(xtdb_octopoes_service: OctopoesService, event_manager: Mock, valid_time: datetime):
xtdb_octopoes_service.nibbler.nibbles = {find_network_url_nibble.id: find_network_url_nibble}
network1 = Network(name="internetverbinding")
xtdb_octopoes_service.ooi_repository.save(network1, valid_time)
event_manager.complete_process_events(xtdb_octopoes_service)
network2 = Network(name="internet")
xtdb_octopoes_service.ooi_repository.save(network2, valid_time)
event_manager.complete_process_events(xtdb_octopoes_service)
url1 = URL(network=network1.reference, raw="https://potato.ls/")
xtdb_octopoes_service.ooi_repository.save(url1, valid_time)
event_manager.complete_process_events(xtdb_octopoes_service)
url2 = URL(network=network2.reference, raw="https://mispo.es/")
xtdb_octopoes_service.ooi_repository.save(url2, valid_time)
event_manager.complete_process_events(xtdb_octopoes_service)
assert xtdb_octopoes_service.ooi_repository.list_oois({Finding}, valid_time).count == 1
assert xtdb_octopoes_service.ooi_repository.list_oois({KATFindingType}, valid_time).count == 0
find_network_url_nibble_v2 = find_network_url_nibble.model_copy(deep=True)
find_network_url_nibble_v2._payload = getattr(sys.modules[__name__], "find_network_url_v2")
find_network_url_nibble_v2._checksum = "deadbeef"
xtdb_octopoes_service.nibbler.update_nibbles(
valid_time, {find_network_url_nibble_v2.id: find_network_url_nibble_v2}
)
event_manager.complete_process_events(xtdb_octopoes_service)
assert xtdb_octopoes_service.ooi_repository.list_oois({Finding}, valid_time).count == 1
assert xtdb_octopoes_service.ooi_repository.list_oois({KATFindingType}, valid_time).count == 1

but it has been justly questioned whether this is true for all cases on all platforms and such see:
https://github.com/minvws/nl-kat-coordination/pull/3808/files/c4c561d469d88cb9f75cba403ebded744925e81f#r1893873228

Hope that satisfies at least parts of you requests.

@underdarknl
Copy link
Contributor

Would a subset of ooi fields be considered unique/similar enough to hash? I could imagine that an object still needs to be considered identical when for instance a field contains a timestamp (don't know if this is a likely scenario though).

It depends on the usecase. For Uniqueness in the graph, the primary-key fields are enough. For detecting field-value changes when dealing with nibbles, the minimal set of fields that we could hash could be based on the full set of fields possibly accessed and or processes in that specific nibble.
This leads us to two (or more) separate hashing needs, unless we find a way to cover them all with a single hash, however we'd need to somehow do this without triggering unneeded false-positives in the situations where we where not interested in all those extra fields being in our hash.

@originalsouth
Copy link
Contributor Author

It depends on the usecase. For Uniqueness in the graph, the primary-key fields are enough. For detecting field-value changes when dealing with nibbles, the minimal set of fields that we could hash could be based on the full set of fields possibly accessed and or processes in that specific nibble.

As mentioned before, I think it is important to separate XTDB and Octopoes cases.

For the XTDB case primary keys are sufficient.
For the Octopoes case (i.e. Python side of the story), which mutates OOI's, it is not so a priori.

Or to put it other words one puts two OOI's with the same primary-key/hash in a set/dict, what would be the desired behavior: everything else has to be implemented accordingly.

One can use this code to fool around with the possibilities, by changing the __hash__/__eq__ functions accordingly:

#!/usr/bin/env python

import random
from typing import Any

from pydantic import BaseModel


class Potato(BaseModel):
    potato_properties: dict[str, Any]

    def __hash__(windows95) -> int:
        return random.randint(0, 0xFFFFFFFF)

    def __eq__(windows95, _) -> bool:
        return True


potato1 = Potato(potato_properties={"type": "piper"})
potato2 = Potato(potato_properties={"type": "king-edward"})
potato3 = Potato(potato_properties={"type": "kennebec"})
potato4 = Potato(potato_properties={"type": "piper"})

s = set([potato1, potato2, potato3, potato4])
d = {potato1: "fried", potato2: "baked", potato3: "boiled", potato4: "fried"}

print(s)
print(d)

Note that there are possibly other routines that use __hash__ under hood -- which we might use and should take note off like chain-maps, caches, counters, unions and the like.

@dekkers
Copy link
Contributor

dekkers commented Jan 3, 2025

Did everyone read the definition of hash, __hash__ and hashable? If not please do so, because this is crucial.

The only purpose of the hash is to provide an integer that can be used by hash tables. This is necessary for hash table based data structures such as sets and dictionaries. Defining __hash__ on a class that also defines __eq__ makes it possible to use instances of such a class in sets and as key in a dictionary.

The first requirement of __hash__ is that the return value must be equal if the object compares equal. That the hash of two objects are equal does not mean the objects compare equal, it is possible for two different objects to have the same hash value. This isn't a problem because hash tables need to deal with collisions anyway. If it happens too often it might result in performance problems or even DoS attack if an attacker can causes a lot of collisions.

The other requirement is that the hash value of an object must never change. This is logical given that it is used to determine the hash table bucket of the object, if the hash would change it means an object would be in the wrong bucket of the hash table. This is also the reason a mutable dictionary doesn't have a hash and can't be used in sets or as dictionary keys, because there is no way to statify both of these requirements.

So is our current __hash__ implementation broken given the requirements? As far as I can see this doesn't seem to be the case. We don't really support mutating OOI python objects however, at least not changing attributes that are part of the primary key, because the primary_key attribute is only computed once in model_post_init. This does not seem to break any of the Python requirements of __hash__ however. It does gives some surprising results if you do mutate attributes that are used in the primary key:

In [2]: n1 = Network(name="test")

In [3]: n2 = Network(name="test")

In [4]: n3 = Network(name="different")

In [5]: n1 == n2
Out[5]: True

In [6]: n1 == n3
Out[6]: False

In [7]: n2.name = "different"

In [8]: n1 == n2
Out[8]: False

In [9]: n2 == n3
Out[9]: False

In [10]: s = set() ; s.add(n1); s.add(n2) ; s.add(n3)

In [11]: s
Out[11]:
{Network(object_type='Network', scan_profile=None, user_id=None, primary_key='Network|different', name='different'),
 Network(object_type='Network', scan_profile=None, user_id=None, primary_key='Network|test', name='different'),
 Network(object_type='Network', scan_profile=None, user_id=None, primary_key='Network|test', name='test')}

I think in most places we create new python objects so that this doesn't result in problems. It would have been better if those attributes would have been immutable to prevent mistakes, but I don't think it is worth spending time on changing that given that we want to move to XTDB 2.

Note that Django ORM does something similar as we do: it defines __hash__ as the hash of the primary key. The difference is that Django also defines an __eq__ method that considers an object of the same type and with the same primary key as equal.

@originalsouth
Copy link
Contributor Author

@dekkers this isn't an XTDB1/XTDB2 issue. The point is that Octopoes can generate OOI's with the same PK but with other attributes. While we have a way of dealing with that in XTDB (by "adding the fields") in the Octopoes/Python domain these are distinct objects that now have the same hash -- this is wrong, and that is what this issue is about -- and how to deal with it.

@Donnype
Copy link
Contributor

Donnype commented Jan 8, 2025

@originalsouth After looking into this more I concluded the following. In terms of cryptographic hashing, as @dekkers notes Python's builtin __hash__ is not supposed to be one, and in the case for OOIs only provides hashability so that we can do nice set operations. These set operations do not fail because the other part of the check duplicates in a set is __eq__, which is provided by Pydantic in our case and simply checks equality of all the fields I believe (although it's surprisingly hard to find docs on this).

In terms of equality checks and the usage of sets of OOIs, if we wouldn't have the equality check boiled in we would not consider them equal (see below). So as @dekkers's example shows, if we'd have e.g. two instances of "the same" IPPort with the same primary key where one has state = "open" and the other has state = "closed", they would not be equal (in terms of == and set()) although they have the same primary key and hence the same hash. I think this means that this:

OOI's with fields not recorded in the primary key are erroneously deemed to be the same objects, causing Python's built-in hash dependent structures to find collapses.
is actually not the case.

So in my eyes this definition of __hash__ is fine unless we want to differentiate between two exactly the same OOIs using a set, but the problem would be the __eq__ method from Pydantic and not this __hash__. I also would be surprised if are using so many exactly the same objects that we see a performance hit in the nibbles?

And that brings me to what I think should perhaps be discussed here instead of the __hash__ of OOIs: what is the cryptographic hash we want to use on our OOI models? And I think since

>>> {"a": 2, "b": 1} == {"b": 1, "a": 2}
True

we shouldn't use hasher_1 or make sure it is stable over ordering.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discussion octopoes Issues related to octopoes
Projects
Status: To be discussed
Development

No branches or pull requests

7 participants