Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix several issues with the hashing method #2049

Closed
wants to merge 2 commits into from

Conversation

greschd
Copy link
Member

@greschd greschd commented Oct 11, 2018

Fixes #2008, #2009.

  • dict items are sorted after hashing, fixing the behavior for
    keys which can not be sorted
  • unicode/str and str/bytes are explicitly handled, with a different
    type salt
  • in general, the combination between two hashes is done via the
    method of boost::hash_combine, not by concatenating the hexdigest

TODO: Add migration that deletes all stale hashes.

* dict items are sorted _after_ hashing, fixing the behavior for
  keys which can not be sorted
* unicode/str and str/bytes are explicitly handled, with a different
  type salt
* in general, the combination between two hashes is done via the
  method of boost::hash_combine, not by concatenating the hexdigest
@greschd
Copy link
Member Author

greschd commented Oct 11, 2018

Please check the _hash_combine function carefully in particular, since that is quite essential in getting good hashes.

@greschd
Copy link
Member Author

greschd commented Oct 11, 2018

@giovannipizzi @dev-zero There is indeed a problem with unicode / bytes when going through the database: The keys of the get_attrs() turn into unicode only after the node is stored. Do you think this is something we can fix easily, or would it be better to keep the same hash for equal unicode and bytes?

@dev-zero
Copy link
Contributor

general question: what was the motivation to use SHA-244 instead of SHA256?

@greschd
Copy link
Member Author

greschd commented Oct 11, 2018

I'm not sure, maybe @lekah could know.

@dev-zero
Copy link
Contributor

wrt the hash combine, there is more to the story: boostorg/functional@0471fb7#diff-005c326126e7e14df58f94a00efdbeb5R208
It seems the appropriate procedure depends on the size of the hash, with the largest size considered by boost being an uint64_t while we have essentially uint256_t, see also HowardHinnant/hash_append#7 for a discussion and https://softwareengineering.stackexchange.com/a/63599/188077 for an explanation on how to obtain the magic constants.

@dev-zero
Copy link
Contributor

ok, I've tracked the boost hash mixing down to here: https://github.com/aappleby/smhasher/wiki/MurmurHash3
There the author proposes to use (for MurmurHash3) [...] 128-bit versions for generating unique identifiers for large blocks of data [...] ... which is exactly the case we have. Since there is no SHA smaller than 224bit SHA, we could either go with mm3 or BLAKE2b

@greschd
Copy link
Member Author

greschd commented Oct 11, 2018

The problem with these is that none of them is in hashlib.algorithms_guaranteed, meaning that their availability is platform - dependent. To be honest, I don't really get the point of having these additional parts on top of the XOR.

Another option I found is that hashlib hashes have an update method, where you can just feed in additional bytes. The benefit there is that we don't have to roll our own, but it's not really clear to me how to solve the sorting issue in that case.

@dev-zero
Copy link
Contributor

ok, I think BLAKE2 is the way to go, especially because of this: https://docs.python.org/3/library/hashlib.html#personalization and, since the hashing seems to be called quiet often, being fast is also not bad ;-)
It is actually part of Python 3.6+, for earlier versions there is the backport: https://pypi.org/project/pyblake2/

Wrt using update(): I'm currently investigating BLAKE2's tree hashing (with unlimited fanout, fixed depth and constant offset) which currently seems the way to go to continuously update a hash with intermediate hashes, which would allow sorting.

@greschd
Copy link
Member Author

greschd commented Oct 12, 2018

Ok, using the backport (or, I guess that was the original library) seems like a good solution for Python2.

As for the tree hashing, do you know if it's required to know the offset of each node within a specific depth level? If so, I guess that would be a bit hard to do, at least if we want to keep this recursive implementation.

@dev-zero
Copy link
Contributor

That's what I'm trying to understand. Currently I think that if you have an unlimited fanout you can leave the offset=0 with the depth being 1 for the child nodes (an arbitrarily deep tree can be hashed as a 2 level tree with infinite fanout). I guess we would then hash the keys of a dict, sort the values by the hash of the key and then yield [k1_hash, v1_hash, k2_hash, v2_hash, ...].

@greschd
Copy link
Member Author

greschd commented Oct 15, 2018

@dev-zero I don't think I'll have time to look into this closer before next week, would you mind if I assign these issues to you?

@dev-zero
Copy link
Contributor

Sure, no problem.

@greschd
Copy link
Member Author

greschd commented Oct 15, 2018

Thanks!

@lekah
Copy link
Contributor

lekah commented Oct 16, 2018

general question: what was the motivation to use SHA-244 instead of SHA256?

Any hashing function is fine, I don't remember why we chose 244.

@dev-zero
Copy link
Contributor

obsoleted by #2110

@dev-zero dev-zero closed this Oct 25, 2018
@greschd greschd deleted the fix_python3_hashing branch December 13, 2019 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hashing mixed-type dictionaries broken in Python 3
3 participants