-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shouldn't hash_combine use different constants for 32/64 bit size_t? #7
Comments
This code came from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf and is used only for comparison purposes with that proposal. |
Hmm, and is there a way for non-comitee member, i.e. for just anyone, to submit such a remark to N3876? Because it is really 32-bit based. The paper on which Boost have based it, mentioned also in the N3876 is http://www.cs.rmit.edu.au/~jz/fulltext/jasist-tch.pdf |
The constant itself should be coming from TEA algorithm, if we admit to process of its derivation, as described here https://softwareengineering.stackexchange.com/a/63599/188077 we must come up with different constant for different hash lengths. As follows:
|
This: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0814r2.pdf is the latest |
Thank you, Howard, I have looked at the paper, and "proposed wording" looks quite generic and does not suggest any particular implementation, so there's actually no impact on the standard, and reference example still example and not final implementation. So looks like it is more proper to file this to the Boost. |
I'm just researching a little on how to create a simple hash-combine for my project, e.g. to support I came across the mentioned hash-combine
in multiple places, and thus also this issue. Since I'm interested mostly in a version where size_t is 64bit, and I'm greateful for the "magic number" for 64bit hashes posted above. However, what is not clear, is what should be done to the bit-shifts. Can anyone here judge if this the following would be a reasonable hash combiner for 64bit?
Should the shifts be different? For my application it doesn't have to be perfect and I'd prefer a simple and fast solution, but I'd like to avoid obvious mistakes. |
Since 64 bit is 2x of 32 bit, you can try increasing shifts 2x. i.e. |
To make it even more portable, maybe following can be done: #include <cstdint>
template <typename T>
inline void hash_combine (std::uint16_t& seed, const T& val)
{
seed ^= std::hash<T>{}(val) + 0x9e37U + (seed<<3) + (seed>>1);
}
template <typename T>
inline void hash_combine (std::uint32_t& seed, const T& val)
{
seed ^= std::hash<T>{}(val) + 0x9e3779b9U + (seed<<6) + (seed>>2);
}
template <typename T>
inline void hash_combine (std::uint64_t& seed, const T& val)
{
seed ^= std::hash<T>{}(val) + 0x9e3779b97f4a7c15LLU + (seed<<12) + (seed>>4);
} Proper overload will be automatically invoked, depending on the actual size of the |
Awesome thanks for you reply. Makes sense to me. Might not be the best ever possible, but considering we were using a simple xor at some point to combine hashes, this is a step up. I ended up with using // to work around static_assert(false, ...)
template <class T>
struct dependent_false : std::false_type {};
template <class T>
inline void hash_combine(std::size_t& seed, const T& value) {
// Simple hash_combine, see e.g. here:
// https://github.com/HowardHinnant/hash_append/issues/7
// Not sure we ever need 32bit, but here it is...
if constexpr (sizeof(std::size_t) == 4) {
seed ^= std::hash<T>{}(value) + 0x9e3779b9U + (seed << 6) + (seed >> 2);
} else if constexpr (sizeof(std::size_t) == 8) {
seed ^= std::hash<T>{}(value) + 0x9e3779b97f4a7c15LLU + (seed << 12) +
(seed >> 4);
} else {
static_assert(dependent_false<T>::value, "hash_combine not implemented");
}
} |
@NikolausDemmel |
unsigned is clear, but what happened apparently is that |
Is there any update on the fix for this issue? |
@JVApen This is not a bug, just a suggestion. If you find it useful, you can take |
That's correct. |
Despite @HowardHinnant doesn't recommend it, personally I still find |
Instead of writing code immediately, wouldn't it be better to first come up with a clear, mathematical description of what a hash function should accomplish? At some point I was in the need for a hash_combine for pointers. Note that a pointer is unique almost by definition (unless it points to dynamically allocated memory and you free that memory allocation again), and often used as such, i.e. use pointers to global instances to assure uniqueness. Therefore there is no need to hash a pointer: it is its own hash. However, when combining two pointers there is quickly a problem from the fact that (assuming pointers to memory allocated from the heap) the distance between the pointers is relatively small (the high bits are equal) and aligned (the low bits are all zero). All entropy is concentrated into a small portion of the available bits (assume 64 bits). If the entropy of those bits are not spread out over the full range then the hash combine is bad because that will increase the chance on collisions. What I did was look for collisions between any pair of two pointers from a large collection of pointers to "allocated" data; not really allocated - just all spaced 4 bytes apart with a different offset typical for malloc implementations (ie, near zero, near 2^32, near 2^64 and somewhere in between). A set of 10,000 of such "allocations" already gives rise to 100,000,000 pairs; by collapsing -say- every two bits into one (ie, assume 11 == 00, and 01 == 10; and other combinations) it is easy to do brute force searches. What I did instead was look at the "distance" between two hash values in terms of different number of bits. Aka, consider two hash values h1 and h2, then their distance is pop_count(h1 ^ h2). If h1 and h2 are totally uncorrelated then that distance behaves the same as if h1 and h2 are pure random numbers. Note that here h1 and h2 are both combined hashes from two "pointers". The result that I ended up with gave those statistics: even using 10,000 pointers in a concentrated region of memory and then considering all possible 100,000,000 pairs - all those pairs had a distance distribution that one would expect for pure random 64bit values. This is just an example of such reasoning (and testing) though. I don't think that my hash_combine is suitable as a general hash_combine function (because it outputs zero when you input twice zero; although that is something that could easily be fixed, it DOES show that care has to be taken; it is also not a problem for my case, because it really is only intended for pointers to memory allocations, not null_ptr). |
This is wrong. Pointers tend to be multiple of 8, and hash tables often are 2^n size, so using bare pointers as key waste 7 over 8 hash entries, causing an average of says 4 collisions when a hash try to maintain a alloc/size ratio of 2. |
Pointers are unique - so there is no reason to apply a hash algorithm on them. While two (different) pointers never collide (they are unequal) their hash MIGHT collide - so you only worsen the situation. If two pointers are equal then any hash will also be equal of course. A hash (ie std::hash) is also 8 bytes. Hashing a pointer is just nonsense. I have no idea what you think would be colliding. If pointers are different, they are different. There is no collision without hashing. Only hashes collide. |
So the entire point of And even more importantly, This is accomplished by recognizing that hashing algorithms are composed of 3 stages:
Steps 1 and 3 are always done once per hash. But step 2 can be done many times in the computation of a single hash. Thus step 2 can be used to input multiple types, with multiple values into a single hash. |
First, hash values are clipped by the size of the hash table, so unicity of the pointer does not give any clue over collisions. Most hash implementation use a hastable with a power of 2 size because computing the modulo is way quicker (it's a bit mask) and as collision management implies to have a table significantly bigger that the number of entries, power of 2 fit the need with no drawbacks. This make the choice to use identity as hash function for pointers and integers very bad. As pointers are most of the time multiple of 8, if not 16, due to malloc being conservative for memory alignement, using the identity as hash function for pointers will cause 7/15 over 8/16 hash entries to ba wasted,, and often cause an average of 4/8 collisions for every entry. For integers, identity is also not good because there is lot of chance that the non random distribution of integers modulo 2^N will cause systematic collisions. you can only reasonably use identity as hash if your hash table size is a large prime number, as a reccurence over a large prime is as improbable as the prime value. |
Well, that's a good point, didn't think about this from this point of view. So yes, we better apply a hash function on a pointer too. This can be done by casting it to |
Ok, if you need a hash (of 64-bit) that is suited for clipping to any (smaller) number of bits, then what you need is to spread the entropy of the input to all bits of the hash, but with an emphasis on the lower significant bits, because the least significant the larger then chance they will be used. Since pointers have a very low entropy in the three or four least significant bits, a good start would be to rotate them 4 bits. Nevertheless, this wasn't my point. The point of my first post was that it depends on the nature of the input and how the output is used (clipping thus) that will dictates requirements for a hash function; and then only once you understand all that in mathematical terms you can design a hash function. On the topic of pointers, here is my implementation to hash_combine them:
The PS This assumes that a pointer is 64-bit. So the |
As described by the paper |
|
But the idea of |
Just a heads up for anyone who comes to this thread looking for the 128-bit |
@trbabb Thank you for providing 128-bit constant. My constant was specifically for 64 bit. |
No it wasn't :P. #7 (comment) Which is equal to 0x9e3779b97f4a7c15f39cc0605cedc834 for the first 100 bits too. |
Current code seems to be taken from Boost, but it is likely 32-bit oriented. Shouldn't the different constant be used for 64-bit size_t?
hash_append/n3876.h
Line 34 in bd892bf
The text was updated successfully, but these errors were encountered: