[WIP] Fix several issues with the hashing method #2049

greschd · 2018-10-11T14:20:18Z

Fixes #2008, #2009.

dict items are sorted after hashing, fixing the behavior for
keys which can not be sorted
unicode/str and str/bytes are explicitly handled, with a different
type salt
in general, the combination between two hashes is done via the
method of boost::hash_combine, not by concatenating the hexdigest

TODO: Add migration that deletes all stale hashes.

* dict items are sorted _after_ hashing, fixing the behavior for keys which can not be sorted * unicode/str and str/bytes are explicitly handled, with a different type salt * in general, the combination between two hashes is done via the method of boost::hash_combine, not by concatenating the hexdigest

greschd · 2018-10-11T14:29:44Z

Please check the _hash_combine function carefully in particular, since that is quite essential in getting good hashes.

greschd · 2018-10-11T14:34:37Z

@giovannipizzi @dev-zero There is indeed a problem with unicode / bytes when going through the database: The keys of the get_attrs() turn into unicode only after the node is stored. Do you think this is something we can fix easily, or would it be better to keep the same hash for equal unicode and bytes?

aiida/common/hashing.py

dev-zero · 2018-10-11T15:03:30Z

general question: what was the motivation to use SHA-244 instead of SHA256?

greschd · 2018-10-11T15:06:48Z

I'm not sure, maybe @lekah could know.

dev-zero · 2018-10-11T15:25:46Z

wrt the hash combine, there is more to the story: boostorg/functional@0471fb7#diff-005c326126e7e14df58f94a00efdbeb5R208
It seems the appropriate procedure depends on the size of the hash, with the largest size considered by boost being an uint64_t while we have essentially uint256_t, see also HowardHinnant/hash_append#7 for a discussion and https://softwareengineering.stackexchange.com/a/63599/188077 for an explanation on how to obtain the magic constants.

dev-zero · 2018-10-11T15:49:35Z

ok, I've tracked the boost hash mixing down to here: https://github.com/aappleby/smhasher/wiki/MurmurHash3
There the author proposes to use (for MurmurHash3) [...] 128-bit versions for generating unique identifiers for large blocks of data [...] ... which is exactly the case we have. Since there is no SHA smaller than 224bit SHA, we could either go with mm3 or BLAKE2b

greschd · 2018-10-11T18:19:56Z

The problem with these is that none of them is in hashlib.algorithms_guaranteed, meaning that their availability is platform - dependent. To be honest, I don't really get the point of having these additional parts on top of the XOR.

Another option I found is that hashlib hashes have an update method, where you can just feed in additional bytes. The benefit there is that we don't have to roll our own, but it's not really clear to me how to solve the sorting issue in that case.

dev-zero · 2018-10-12T12:43:35Z

ok, I think BLAKE2 is the way to go, especially because of this: https://docs.python.org/3/library/hashlib.html#personalization and, since the hashing seems to be called quiet often, being fast is also not bad ;-)
It is actually part of Python 3.6+, for earlier versions there is the backport: https://pypi.org/project/pyblake2/

Wrt using update(): I'm currently investigating BLAKE2's tree hashing (with unlimited fanout, fixed depth and constant offset) which currently seems the way to go to continuously update a hash with intermediate hashes, which would allow sorting.

greschd · 2018-10-12T12:52:38Z

Ok, using the backport (or, I guess that was the original library) seems like a good solution for Python2.

As for the tree hashing, do you know if it's required to know the offset of each node within a specific depth level? If so, I guess that would be a bit hard to do, at least if we want to keep this recursive implementation.

dev-zero · 2018-10-12T13:19:29Z

That's what I'm trying to understand. Currently I think that if you have an unlimited fanout you can leave the offset=0 with the depth being 1 for the child nodes (an arbitrarily deep tree can be hashed as a 2 level tree with infinite fanout). I guess we would then hash the keys of a dict, sort the values by the hash of the key and then yield [k1_hash, v1_hash, k2_hash, v2_hash, ...].

greschd · 2018-10-15T15:04:07Z

@dev-zero I don't think I'll have time to look into this closer before next week, would you mind if I assign these issues to you?

dev-zero · 2018-10-15T15:28:00Z

Sure, no problem.

greschd · 2018-10-15T15:29:08Z

Thanks!

lekah · 2018-10-16T06:29:18Z

general question: what was the motivation to use SHA-244 instead of SHA256?

Any hashing function is fine, I don't remember why we chose 244.

dev-zero · 2018-10-25T16:22:02Z

obsoleted by #2110

greschd added 2 commits October 11, 2018 16:15

Add docstrings, remove dead code.

d8cea6c

greschd requested review from dev-zero and giovannipizzi October 11, 2018 14:39

dev-zero reviewed Oct 11, 2018

View reviewed changes

aiida/common/hashing.py Show resolved Hide resolved

dev-zero mentioned this pull request Oct 24, 2018

Fix several issues with the hashing method #2110

Merged

dev-zero closed this Oct 25, 2018

greschd deleted the fix_python3_hashing branch December 13, 2019 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix several issues with the hashing method #2049

[WIP] Fix several issues with the hashing method #2049

greschd commented Oct 11, 2018 •

edited

Loading

greschd commented Oct 11, 2018

greschd commented Oct 11, 2018

dev-zero commented Oct 11, 2018

greschd commented Oct 11, 2018

dev-zero commented Oct 11, 2018

dev-zero commented Oct 11, 2018

greschd commented Oct 11, 2018

dev-zero commented Oct 12, 2018

greschd commented Oct 12, 2018

dev-zero commented Oct 12, 2018

greschd commented Oct 15, 2018

dev-zero commented Oct 15, 2018

greschd commented Oct 15, 2018

lekah commented Oct 16, 2018

dev-zero commented Oct 25, 2018

[WIP] Fix several issues with the hashing method #2049

[WIP] Fix several issues with the hashing method #2049

Conversation

greschd commented Oct 11, 2018 • edited Loading

greschd commented Oct 11, 2018

greschd commented Oct 11, 2018

dev-zero commented Oct 11, 2018

greschd commented Oct 11, 2018

dev-zero commented Oct 11, 2018

dev-zero commented Oct 11, 2018

greschd commented Oct 11, 2018

dev-zero commented Oct 12, 2018

greschd commented Oct 12, 2018

dev-zero commented Oct 12, 2018

greschd commented Oct 15, 2018

dev-zero commented Oct 15, 2018

greschd commented Oct 15, 2018

lekah commented Oct 16, 2018

dev-zero commented Oct 25, 2018

greschd commented Oct 11, 2018 •

edited

Loading