-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy Avoidance #1457
base: unstable
Are you sure you want to change the base?
Copy Avoidance #1457
Conversation
c2d1a60
to
a0a156c
Compare
https://github.com/valkey-io/valkey/actions/runs/12395947567/job/34606854911?pr=1457 Means you are leaking some memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a super comprehensive review. Mostly just some comments to improve the clarity, since the code is complex but seems mostly reasonable.
The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.
I didn't follow the second half of this sentence. Do you mean "unless decrease in cob size is important"? I find that unlikely to be the case. I would also still like to understand better why it degrades performance.
I just looked briefly, mostly at the discussions. I don't think we should call this feature "reply offload". The design is not strictly limited to offloading to threads. It's rather about avoiding copying.
So there appears to be some overhead with this approach? It could be that cob memory is already in CPU cache, but when the cob is written to the client, the values are not in CPU cache anymore, so we get more cold memory accesses. Anyhow, you tested it only with 512 byte values? I would guess this feature is highly dependent on the value size. With a value size of 100MB, I would be surprised if we don't see an improvement also in single-threaded mode. Is there any size threshold for when we embed object pointers in the cob? Is it as simple as if the value is stored as OBJECT_ENCODING_RAW, the string is stored in this way? In that case, the threshold is basically around 64 bytes practice, because smaller strings are stored as EMBSTR. I think we should benchmark this feature with several different value sizes and find the reasonable size threshold where we benefit from this. Probably there will be a different (higher) threshold for single-threaded and a lower one for IO-threaded. Could it even depend on the number of threads? |
db824f4
to
04e41c1
Compare
Fixed |
ac7e1f5
to
a40e72e
Compare
From the tests and perf profiling it appears that main cause for performance improvement from this feature comes from eliminating expensive memory access to |
Very good questions. I will publish results of various tests with and without I/O threads and with different data sizes on next week. IMPORTANT NOTE: we can't switch on or off reply offload dynamically according to obj(string) size cause main optimization is to eliminate expensive memory access to |
Got it. Thanks! At least, when the feature is ON, it doesn't make sense to dynamically switch it OFF based on length. But for single-threaded mode where this feature is normally OFF, we could consider switching it ON dynamically only for really huge strings, right? In this case we will have one expensive memory access, but we could avoid copying megabytes. Let's see the benchmark results if this makes sense. I appreciate you're testing this with different sizes and with/without IO threading. |
Published results of performance tests in the PR description |
a40e72e
to
cff89de
Compare
cff89de
to
72be14b
Compare
Thanks for the benchmarks! This is very interesting. For large values (64KB and above), it is a great improvement also for single-threaded. For 512 bytes values, it's faster only with 9 threads or more. This confirms my guess that it's not only about offloading work to IO threads, but also about less copying for large values. We should have some threshold to use it also for single threaded. I suggest we use 64KB as the threshold, or benchmark more sizes to find a better threshold. |
Pay attention 9 threads means main thread + 8 I/O threads. Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly. |
@alexander-shabanov I do not think adding a configuration parameter is the preferred option in this case. Users are almost never tuning their caches at these levels and it is also very problematic to tell the user to learn his workload and formulate ahis own rules to when to enable this config. I also think there is some risk in introducing this degradation so we should work to understand what other alternatives we have.
|
c->io_last_bufpos = ((clientReplyBlock *)listNodeValue(c->io_last_reply_block))->used; | ||
clientReplyBlock *block = (clientReplyBlock *)listNodeValue(c->io_last_reply_block); | ||
c->io_last_bufpos = block->used; | ||
/* If reply offload enabled force new header */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should indeed check if reply offload is enabled before writing NULL to the header? Checking a global is cheaper than write access to the block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, with or without reply offload, main thread just finished to write to this block
src/networking.c
Outdated
} | ||
|
||
static inline int updatePayloadHeader(payloadHeader *last_header, uint8_t type, size_t len, int slot) { | ||
if (last_header->type == type && last_header->slot == slot) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using for example MGET with small values that are not offloaded and come from different slots, wouldn't this cause multiple plain headers, thus requiring writeV instead of a simple write? We should investigate if this causes any performance degradation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In CMD slot will be always equal to -1 (see if (!clusterSlotStatsEnabled(slot)) slot = -1;
in upsertPayloadHeader
)
In CME MGET keys must map to the same slot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does mean that some plans to support cross slot operations in CME would have to consider this as well.
I agree we can consider this to be solved when the time comes. probably per slot accounting will have to be refactored a bit in order to support cross slot commands.
Just like Ran and Madelyn, I don't want to introduce a config. There are many reasons or that. This is an optimization, not new functionality. Most users don't tweak configs like that. An optimization can change, but a config needs to be maintained and needs backward compatibility. The more configs we add, the harder it is for users to tune the combination of all configs, so we should do our best to find the best behavior automatically. Second topic: There are three code paths worth considering.
Case 3 is a very powerful optimization for long strings. Some users store large data in strings. IO threads disabled is also not a corner case in any way. It is common to run small instances without IO threads and to scale horizontally using more cluster nodes instead of vertically using threads. Instead of configs, I think we should pick some safe constants for N and M to make sure we don't get a regression in any case. I suggest N = 9 and M = 16K. |
@zuiderkwast We discussed it and going to propose in this PR to perform reply offload according to static number of I/O threads (i.e. io-threads config). Static number of I/O threads is more preferable than active I/O threads because it makes all the tests and troubleshooting of potential issues much more deterministic. I am working on final code changes and tests. The reply offload activation according to size of object is deferred to another PR. The best solution will be to perform reply offload according to actual main thread load matching the condition where reply offload is beneficial. However, it requires much more complex research. |
|
Thank you @alexander-shabanov . I think we need to be careful about this PR as this is near the release date. I think it is good we will focus on introducing a simple optimization and later focus on better optimization for dynamic support or large strings when io-threads are disabled. |
@alexander-shabanov / @uriyage following the discussion with the maintainers let's try to answer some of the followup questions:
|
please these replies: |
6bf48aa
to
cb5e97f
Compare
@zuiderkwast @ranshid I uploaded new revision addressing main comment regarding reply offload efficiency starting certain number of threads. Updated description with changes in the code + config and performance test numbers |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1457 +/- ##
============================================
- Coverage 71.02% 70.93% -0.09%
============================================
Files 121 121
Lines 65249 65447 +198
============================================
+ Hits 46344 46428 +84
- Misses 18905 19019 +114
|
cb5e97f
to
2293a6a
Compare
Future Enhancements The reply offload optimization in this PR is enabled according to (static) number of I/O threads with no regard to object size cause it is expected to benefit any bulk string reply starting certain number of threads. As performance tests show, the reply offload can benefit large objects (e.g. 64 KB) starting much lower number of threads. In order to use this benefit following challenges should be addressed:
The 1st challenge can be addressed by using flag on serverObject. The blocker for this approach is that right now all 64 bits on serverObject are already in use. Alternative can be next observation - if number of I/O threads lower than The 2nd challenge can be addressed by using So, potential solutions for these challenge should be researched, tested and tuned. We are pretty sure these solutions will leverage reply construction and write-to-client mechanisms introduced in this PR. |
src/networking.c
Outdated
} | ||
|
||
static inline int updatePayloadHeader(payloadHeader *last_header, uint8_t type, size_t len, int slot) { | ||
if (last_header->type == type && last_header->slot == slot) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does mean that some plans to support cross slot operations in CME would have to consider this as well.
I agree we can consider this to be solved when the time comes. probably per slot accounting will have to be refactored a bit in order to support cross slot commands.
return C_ERR; | ||
} | ||
|
||
static size_t upsertPayloadHeader(char *buf, size_t *bufpos, payloadHeader **last_header, uint8_t type, size_t len, int slot, size_t available) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is upser? I would have thought a typo so but it is constantly used all over the code and comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also can we document what this internal function is responsible of doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upsert is known terminology for update or Insert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I now see upsert (need glasses)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update-or-insert, I had to look it up actually. It wasn't in my vocabulary until now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ranshid upsertPayloadHeader body is commented thoroughly . Let me know if any specific comment required
2293a6a
to
b897fc1
Compare
b897fc1
to
eba682b
Compare
e5342fc
to
b9173ee
Compare
b9173ee
to
2cfbc3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review. I can't continue right now so I'm posting my pending comments. I've reviewed to the middle of networking.c to just before the iovec definitions.
I'm convinced this feature is worth the complexity.
src/networking.c
Outdated
return _addReplyPayloadToBuffer(c, s, len, CLIENT_REPLY_PLAIN); | ||
} | ||
|
||
static size_t _addBulkOffloadToBuffer(client *c, const void *payload, size_t len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offload or CopyAvoid?
Let's think about the terminology for a bit. Maybe we can find some terms that are even more explicit about what the buffer contains.
We have three type of buffer content and we should ideally use different words for each of them:
- unencoded buffer with PLAIN content, no payload headers
- encoded buffer, payload header type PLAIN, plain content
- encoded buffer, payload header type COPY-AVOID, content is a sequence of
bulkCopyAvoid
objects.
The overloaded use of the word PLAIN is a source of confusion here. I suggest we use PLAIN for unencoded buffer and another word such as INLINE for the payload header type for encoded buffer with plain content.
The bulkCopyAvoid
struct is actually a reference to the object. That's why we increment the reference counter. 😀 So maybe we should use the word "reference" for this kind of buffer content, e.g. _addBulkRefToBuffer
, and the payload header type can be CLIENT_REPLY_REF? Or use the word INDIRECT?
Please also add a short comment above each of these functions to explain what the function name itself doesn't explain, like what it means that if it returns 0, etc.
And one more thing. I think this function can take const bulkCopyAvoid *
instead of void *payload
and size_t len
. Where this function calls _addReplyPayloadToBuffer
we can spell out sizeof(bulkCopyAvoid)
for the length argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Streamlined terminology in the code according to the suggestion
src/networking.c
Outdated
@@ -1127,8 +1277,18 @@ void addReplyBulkLen(client *c, robj *obj) { | |||
_addReplyLongLongWithPrefix(c, len, '$'); | |||
} | |||
|
|||
int tryOffloadBulkReply(client *c, robj *obj) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used only in this file? Make it static. (Shows the intent that it's a local function and it let's the compiler warn about unused functions.)
Rename "Offload"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added static and renamed
/* Adds the reply to the reply linked list. | ||
* Note: some edits to this function need to be relayed to AddReplyFromClient. */ | ||
void _addReplyProtoToList(client *c, list *reply_list, const char *s, size_t len) { | ||
static void _addReplyPayloadToList(client *c, list *reply_list, const char *payload, size_t len, uint8_t payload_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment above belongs to _addReplyProtoToList
I think...?
Please move it and write a more correct comment about _addReplyPayloadToList
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment belongs to original function that became to be _addReplyPayloadToList
. Updated the comment to be more relevant
src/networking.c
Outdated
@@ -484,6 +621,18 @@ void _addReplyToBufferOrList(client *c, const char *s, size_t len) { | |||
if (len > reply_len) _addReplyProtoToList(c, c->reply, s + reply_len, len - reply_len); | |||
} | |||
|
|||
void _addBulkOffloadToBufferOrList(client *c, robj *obj) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename?
Add a short one-line comment as documentation.
Make it static. It's what we've decided to use for local functions. (Keep the underscore prefix if you want. I don't care. It's used in this file but not in other files.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed and added comment
Signed-off-by: Alexander Shabanov <[email protected]>
2cfbc3b
to
96571b5
Compare
* Each chunk contains header followed by payload */ | ||
typedef struct __attribute__((__packed__)) payloadHeader { | ||
size_t len; /* payload length in a reply buffer */ | ||
size_t actual_len; /* actual reply length for non-plain payloads */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider explicitly state in the comment that this represents the bulk string length plus the RESP protocol overhead and maybe rename to actual_reply_len
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wanted to make it more generic for future copy avoidance use cases if any. what do you think? should we make it now very specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always change it in the future.
Signed-off-by: Alexander Shabanov <[email protected]>
96571b5
to
73db90f
Compare
Overview
This PR introduces the ability to avoid copying whole content of string object into replies (i.e. bulk string replies) and to allow I/O threads refer directly to obj->ptr in writev iov as described at #1353.
Key Changes
network-bytes-out
for copy avoid repliesmin-io-threads-copy-avoid
introduced to manage this number of threadsmin-io-threads-value-prefetch-off
introduced to manage this number of threadsNote: When copy avoidance disabled content and handling of client reply buffers remains as before this PR
Implementation Details
client
andclientReplyBlock
structs:buf_encoded
flag has been added toclientReplyBlock
struct and toclient
struct for staticc->buf
to indicate if reply buffer is in copy avoidance mode (i.e. include headers and payloads) or not (i.e. plain replies only).io_last_written_buf
,io_last_written_bufpos
,io_last_written_data_len
fields addedclient
struct to to keep track of write state betweenwritevToClient
invocationsReply construction:
_addReplyToBuffer
and_addReplyProtoToList
have been renamed to_addReplyPayloadToBuffer
and_addReplyPayloadToList
and extended to support different types of payloads - regular replies and copy avoid replies._addReplyToBuffer
and_addReplyProtoToList
calls now_addReplyPayloadToBuffer
and_addReplyPayloadToList
and used for adding regular replies to client reply buffers._addBulkOffloadToBuffer
and_addBulkOffloadToList
are used for adding copy avoid replies to client reply buffers.Write-to-client infrastructure:
The
writevToClient
and_postWriteToClient
has been significantly changed to support copy avoidance capability.Internal configuration:
min-io-threads-copy-avoid
- Minimum number of IO threads for copy avoidancemin-io-threads-value-prefetch-off
- Minimum number of IO threads for disabling value prefetchmin-string-size-copy-avoid
- Minimum bulk string size for copy avoidance when IO threads disabledmin-string-size-copy-avoid-threaded
- Minimum bulk string size for copy avoidance when IO threads enabledTesting
--io-threads
flagPerformance Tests
Note: pay attention
io-threads 1
config means only main thread with no additional io-threads,io-threads 2
means main thread plus 1 I/O thread,io-threads 9
means main thread plus 8 I/O threads.512 byte object size
Tests are conducted on memory optimized instances using:
64 KB object size
Tests are conducted on network optimized instances using: