-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New hash table #1186
New hash table #1186
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1186 +/- ##
============================================
+ Coverage 70.85% 70.90% +0.04%
============================================
Files 118 119 +1
Lines 63612 64652 +1040
============================================
+ Hits 45075 45839 +764
- Misses 18537 18813 +276
|
For command lookup, since the command table is rarely updated, why not build an efficient trie? Or it is even more slower? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a partial review (everything but hashset.c, which I guess is the interesting stuff) and tests. Blocked time off tomorrow to read through :) Exciting stuff!
@xingbowang For command table, yes a perfect hash function would be even better. The main point of this implementation is to use it for the database keys though, which will come in another PR. (A trie is possibly slower because there might be more memory accesses.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent work!
to be honest, I didn't like open-addressing hash tables before, because the tombstones generated by deletions have a very negative impact on both performance and space. I felt they were not suitable for online services with a large number of deletion operations.
However, this design is quite clever, deleting an element does not create a tombstone, only when a bucket has been full at some point is it considered to have a tombstone, which greatly reduces the impact of deletion operations.
A bucket's length is 7, so in terms of chained hashing, a tombstone effect would only occur when there are 7 collisions, which intuitively seems to be a relatively rare situation.
In theory, this should save memory. Compared to the previous chained hashing, it saves two pointers (for the previous and next dictEntry
) and adds one hash value, saving about 15 bytes per element. However, the frequency of resizing might be higher, as resizing is triggered not only by the fill factor but also by the proportion of tombstones. Do we have reliable tests for space and performance?
@soloestoy I'm glad you like the design. In my WIP branch for using this for keys, i get 10-15% better performace for GET compared to unstable. I set the fill factor to 77% currently, and with 7/8 of the space used for pointers (1/8 in each bucket is metadata bitys), it means that 67% of the allocation is pointers. Rehashing triggered by the proportion of tombstones is a corner case. It didn't implement it first, but I realized in the defrag tests, the test code tried to create a lot of fragmentation and it added many keys and eviction deleted many keys. In this case, the table became full of tombstones. It does not happen very often and I think it will not affect the performance. There is a potential problem with CoW during fork though. The dict can avoid rehashing up to it becomes 500% fill factor. This is not possible with open addressing. We have to rehash at some point. I set the hard limit to 90% currently. I hope it will not be a major problem. In the "resize avoid" mode, resizing is allowed but incremental rehashing is paused, so only new keys are added to the new table. This also avoids destroying the CoW. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like 75% review of hashset.c, and my brain is done with reviewing. Overall looks really solid though.
The one thing that occurred to me while I was reviewing is that instead of using the everfull
bit to start probing, we could use it to indicate the last position is a pointer to another bucket and do linked list resolution. This does a couple of things:
- Prevents significant chain lengths, since collisions won't cause issues.
- Keeps the overall scanning implementation much more similar to the current implementation.
- We get most of the performance benefit from the cache lines and the nextCursorCall() is likely almost as random as a random DRAM lookup. (Which we'll probably prefetch anyways?)
For what it's worth, I can make nextCursor() use sequential, in-memory order for rehashing, scanning, and iterator if we use the most significant bits instead of the least significant bits of the hash for bucket index. The math for iterating the two different-sized tables in parallel for rehash and scan would be a little different but not bad - basically bit shifting instead of masking. At any rate, I plan to add prefetch hints - not sure how much that would eliminate the benefit of sequential memory access. |
@madolson This is a great idea. It reminds me of "bulk chaining" in the SegCache paper. I think it can be done without changing the API, so it can be done at any point. I believe it has many benefits, like we can postpone rehashing almost indefinitely like the dict does when there's a fork. We can also prevent scan from returning duplicates (when it wraps around zero following a probe chain).
@SoftlyRaining This sounds interesting. Memory is fetched per cache-line, so I'm wondering if there are benefits to accessing it sequentially to the current reverse-bit-increment. Systems with larger cache lines, sure, but that's only macbook so far? Predictive memory fetching predicts that cache lines are accessed sequentially? |
@zuiderkwast afaik you're right about cache lines. The benefit would be making it predictable so branch prediction can understand it and prefetch buckets efficiently. My guess is that it'd perform almost as well as hand-tuned prefetch hints. |
Hello. I've changed the design based on Madelyn's idea of bucket chaining. It's no longer open addressing.
Furthermore,
|
@touitou-dan Hello! Can you help with a prefetching implementation for this hash table? |
Hey @zuiderkwast , I’d be happy to help! However, I’d first like the opportunity to hear about the new implementation. Would you be available for a chat? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only reviewed the hashset.[hc] code. I'll continue reviewing the remaining files later today.
@zuiderkwast finished the review, and I don't have more comments apart from the ones I sent this morning. Very nice design and implementation! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like that we plan to combine key/value/metadata into a single reference counted "element" and that the main dict just becomes a set of reference counted elements. Using an independent reference counted element makes it much easier to perform internal asynchronous operations.
I mostly reviewed the API definition. My main comments are:
- I don't think we should be differentiating between a "key" and an "element". An element IS A key.
- I don't think we should ever be exposing pointers into the structure. Instead, an opaque
hashsetPosition
can have appropriate accessor methods. - Keep the opaque stuff opaque. Don't let stack allocation be a reason to expose internals.
ed3126a
to
9df0601
Compare
I do not review the changes carefully. But here you mentioned you have 10% -- 15% better performance for GET. I think the point is used memory. If it is, could you share the qps perforamce ? I am not sure how the changes effect to the qps, Thanks |
@hwware Thanks for looking. I did a new benchmark just now. My simple benchmark setup: Unstable:
This branch:
My laptop is Linux on Macbook pro M2. It is unusual, yes. But others got similar results. @SoftlyRaining @uriyage you have done more benchmarks on more normal server hardware, with/without IO threads, and with larger value size. Please share. The explanation of the improved speed is fewer memory lookups, especially memory that are not already in L1 cache. This is a very slow operation, so counting memory accesses is a good way to optimize.
So the total is 2 memory accesses in the most cases. For dict with key embedded in dictEntry (current unstable) with have minimum 3 memory accesses, with more if there are hash collisions.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Through everything now, nothing is standing out as a major concern to me right now. I think it should be good to merge ASAP.
return 0; | ||
} | ||
|
||
if (ht->type->resizeAllowed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does resize allowed callback override the hard limit check? I would expect this to be in the conditional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is that we want to force a resize if the fill factor is above the hard limit, i.e. it's too full and we need to expand. We want to ignore the resizeAllowed in this case. Additionally, the double
fill factor is only computed when needed.
For the context, these are the lines below this one:
if (ht->type->resizeAllowed) {
double fill_factor = (double)min_capacity / ((double)numBuckets(old_exp) * ENTRIES_PER_BUCKET);
if (fill_factor * 100 < MAX_FILL_PERCENT_HARD && !ht->type->resizeAllowed(alloc_size, fill_factor)) {
/* Resize callback says no. */
return 0;
}
}
Maybe you think we also need to forbid the user from explicitly shrink to a very small size, which would take us above the max fill factor? resize
is a internal function and the only way some user can try to shrink it is with hashtableShrinkIfNeeded
, so a resize can't get you above the hard max fill factor.
src/hashtable.c
Outdated
* | ||
* It's allowed to insert and replace entries. Deleting entries is only allowed | ||
* for the entry that was just returned by hashtableNext. Deleting other entries | ||
* are not supported. (It can cause internal fragmentation.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we are trying to handle the internal fragmentation, so is this actually a constraint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean. The code that handles internal fragmentation in iterator and scan only works for the bucket chain that was just iterated/scanned. If the user deletes from another bucket chain while iterating or from the scan callback, it will leave holes that we don't handle.
* bucket chain, before we move on to the next index. */ | ||
if (iter->hashtable->pause_rehash == 1 && | ||
iter->hashtable->used[iter->table] < iter->last_seen_size) { | ||
compactBucketChain(iter->hashtable, iter->index, iter->table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a followup, maybe we should make the safe iterator have the same guarantees as SCAN, in that you have to process everything in the bucket. Defrag should be fine, since it can queue up everything into defrag later, active expire should be able to cover everything. I don't think there should be any other iteration where items can change in between invocations. If we do that, then we never have any of this bucket compaction code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how we can do that. You mean that we shall allow the fillBucketHole
function to run on delete also when iterating? It reorders the entries in the bucket chain. How do we make sure all the entries in the chain are returned exactly once?
d6da54f
to
dfbea9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't review everything, but overall LGTM, it is a great job.
@@ -276,9 +276,9 @@ void restoreCommand(client *c) { | |||
} | |||
|
|||
/* Create the key and set the TTL if any */ | |||
dbAdd(c->db, key, obj); | |||
dbAdd(c->db, key, &obj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we use setKey
here with TTL as a flag to avoid reallocation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setKey doesn't take a TTL either. I want to improve this in a follow-up to provide something like dbAddWithExpire and setKeyWithExpire to avoid reallocation. It needs some refactoring and I don't want this PR to become larger than what it already is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all my concerns are addressed, so let's get it merged :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally finished reviewing all of it! I think it's looking pretty good - my comments are just a few small things I don't think are critical to address. :)
A cache-line aware hash table with a user-defined key-value entry type, supporting incremental rehashing, scan, iterator, random sampling, incremental lookup and more... Signed-off-by: Viktor Söderqvist <[email protected]>
This changes the type of command tables from dict to hashtable. Command table lookup takes ~3% of overall CPU time in benchmarks, so it is a good candidate for optimization. My initial SET benchmark comparison suggests that hashtable is about 4.5 times faster than dict and this replacement reduced overall CPU time by 2.79% 🥳 --------- Signed-off-by: Rain Valentine <[email protected]> Signed-off-by: Rain Valentine <[email protected]> Signed-off-by: Viktor Söderqvist <[email protected]> Co-authored-by: Rain Valentine <[email protected]>
Instead of a dictEntry with pointers to key and value, the hashtable has a pointer directly to the value (robj) which can hold an embedded key and acts as a key-value in the hashtable. This minimizes the number of pointers to follow and thus the number of memory accesses to lookup a key-value pair. Keys robj hashtable +-------+ +-----------------------+ | 0 | | type, encoding, LRU | | 1 ------->| refcount, expire | | 2 | | ptr | | ... | | optional embedded key | +-------+ | optional embedded val | +-----------------------+ The expire timestamp (TTL) is also stored in the robj, if any. The expire hash table points to the same robj. Overview of changes: * Replace dict with hashtable in kvstore (kvstore.c) * Add functions for embedding key and expire in robj (object.c) * When there's unused space, reserve an expire field to avoid realloting it later if expire is added. * Always reserve space for expire for large key names to avoid realloc if it's set later. * Update db functions (db.c) * dbAdd, setKey and setExpire reallocate the object when embedding a key * setKey does not increment the reference counter, since it would require duplicating the object. This responsibility is moved to the caller. * Remove logic for shared integer objects as values in the database. The keys are now embedded in the objects, so all objects in the database need to be unique. Thus, we can't use shared objects as values. Also delete test cases for shared integers. * Adjust various commands to the changes mentioned above. * Adjust defrag code * Improvement: Don't access the expires table before defrag has actually reallocated the object. * Adjust test cases that were using hard-coded sizes for dict when realloc would happen, and some other adjustments in test cases. * Adjust memory prefetch for new hash table implementation in IO-threading, using new `hashtableIncrementalFind` API * Adjust offloading of free() to IO threads: Object free to be done in main thread while keeping obj->ptr offloading in IO-thread since the DB object is now allocated by the main-thread and not by the IO-thread as it used to be. * Let expireIfNeeded take an optional value, to avoid looking up the expires table when possible. --------- Signed-off-by: Uri Yagelnik <[email protected]> Signed-off-by: uriyage <[email protected]> Signed-off-by: Viktor Söderqvist <[email protected]> Co-authored-by: Uri Yagelnik <[email protected]>
Merged. If anyone was going to post comments, please post them anyway. I can handle them as follow-ups. |
The new `hashtable` provides faster lookups and uses less memory than `dict`. A TCL test case "SRANDMEMBER with a dict containing long chain" is deleted because it's covered by a hashtable unit test "test_random_entry_with_long_chain", which is already present. This change also moves some logic from dismissMemory (object.c) to zmadvise_dontneed (zmalloc.c), so the hashtable implementation which needs the dismiss functionality doesn't need to depend on object.c and server.h. This PR follows #1186. --------- Signed-off-by: Rain Valentine <[email protected]> Signed-off-by: Viktor Söderqvist <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]>
The new `hashtable` provides faster lookups and uses less memory than `dict`. A TCL test case "SRANDMEMBER with a dict containing long chain" is deleted because it's covered by a hashtable unit test "test_random_entry_with_long_chain", which is already present. This change also moves some logic from dismissMemory (object.c) to zmadvise_dontneed (zmalloc.c), so the hashtable implementation which needs the dismiss functionality doesn't need to depend on object.c and server.h. This PR follows valkey-io#1186. --------- Signed-off-by: Rain Valentine <[email protected]> Signed-off-by: Viktor Söderqvist <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]>
With verbose logging, this message is logged every 5 seconds. Before #1186, there were 128 slots in a database with 100 keys. Slots here means the number of entries that can be stored on the top-level of a hash table, which in the dict was the same as the number of buckets. With the new hash table, there are multiple entries per bucket, so the number of buckets is not a useful metric to show. This PR multiplies the number of buckets with the number of entry positions per bucket. With this change, the log message is instead 36236:M 09 Feb 2025 15:47:18.923 - DB 9: 100 keys (0 volatile) in 112 slots HT. Signed-off-by: Viktor Söderqvist <[email protected]>
This PR implements a new hash table and uses it for keys, command lookup and more. There are multiple commits. Do not squash-merge.
Hash table design
The hash table is a cache line optimized and implemented as outlined in #169, but changed to chaining instead of probing after an idea by Madelyn. The key-value entry is user-defined, which allows the user to embed key and value within a single allocation. The hash table supports incremental rehashing, scan, random key, etc. just like the dict but faster and using less memory. Each bucket contains a few bits of metadata per entry. For details, see the comments in
src/hashtable.{c,h}
.If a bucket is full, the last entry pointer in a bucket can be replaced by a child-bucket pointer and we get a bucket chain.
Command lookup
The 2nd commit relaces dict with hashtable for command lookup. This was implemented by @SoftlyRaining.
Keys and expire
The 3rd commit replaces dict with the new hash table in kvstore.c and all code that uses it, such as db.c.
The hashtable entry in this case is the robj struct. The key and optionally an expire timestamp are embedded in the
robj
struct, i.e. the key is embedded in the value. Therefore, we can call this a valkey object, val + key. This design saves roughly 20 bytes per key for short string keys.Some db.c functions like
dbAdd
,setKey
andsetExpire
now reallocate the value object to embed the key and optional expire in it.setKey
does not increment the reference counter, since it would require duplicating the object.Fixes #991
Fixes #992