Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add maxmemory-reserved parameter for evicting key earlier to avoid OOM #831

Open
wants to merge 24 commits into
base: unstable
Choose a base branch
from

Conversation

hwware
Copy link
Member

@hwware hwware commented Jul 26, 2024

All cases are under caching, no 2nd storage disk.
When user set maxmemory in config file, OOM (out of memory) message could be returned to user once Valkey reaches the maxmemory threshold.

The PR Incremental eviction processing redis/redis#7653 solves some cases of OOM issues, but in some scenarios, OOM still happens. Through this PR, we could reduce the probability of OOM occurrence or delay the OOM, especially in heavy-write case.

Example:

We have one VM 100GB, and user set maxmemory as 10GB.

If Valkey server is working on a heavy-write process with key-value pair is big (200MB), and we increase maxmemory-eviction-tenacity to adapt to this situation. Before we apply this PR, the following actions will happen:
step 0: used memory is 10GB
step 1: a new write command processes successfully, used memory reaches 10.2GB (no eviction happens)
step 2: a new write command comes, 150MB key is evicted, and used memory is 10.05GB, OOM happens
step 3: User retries the previous write command, another 150MB key is evicted. And used memory is just below 10GB,
the write command can be processed successfully

With this PR, we could set maxmemory-reserved as 9GB. Thus, once the Valkey comes to the 9GB memory, key eviction process will begin.

step 0: used memory is 9GB
step 1: a new write command processes successfully, used memory reaches 9.2GB (no eviction happens)
step 2: the second write command comes, 150MB key is evicted, and used memory is 9.05GB
the write command executes successfully, used memory update to 9.25 GB
step 3: the third write command comes, 150MB key is evicted, and used memory is 9.1GB
the write command executes successfully, used memory update to 9.3 GB
...........
After more than 20 steps, OOM happens.

From the above example, we can see that although with a heavy-write system, the used memory is increasing, but we can significantly delay the OOM.

@hwware hwware force-pushed the maxmemory-reserved-parameter branch from e31dd8b to b8a98df Compare July 26, 2024 15:39
@hwware hwware requested review from PingXie, enjoy-binbin and madolson and removed request for PingXie and enjoy-binbin July 26, 2024 15:52
Copy link

codecov bot commented Jul 26, 2024

Codecov Report

Attention: Patch coverage is 83.01887% with 9 lines in your changes missing coverage. Please review.

Project coverage is 71.03%. Comparing base (aced268) to head (4be3b3c).
Report is 9 commits behind head on unstable.

Files with missing lines Patch % Lines
src/config.c 69.23% 4 Missing ⚠️
src/module.c 0.00% 3 Missing ⚠️
src/evict.c 87.50% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #831      +/-   ##
============================================
+ Coverage     70.87%   71.03%   +0.16%     
============================================
  Files           121      121              
  Lines         65203    65282      +79     
============================================
+ Hits          46212    46375     +163     
+ Misses        18991    18907      -84     
Files with missing lines Coverage Δ
src/server.c 87.72% <100.00%> (+0.09%) ⬆️
src/server.h 100.00% <ø> (ø)
src/evict.c 98.10% <87.50%> (-0.76%) ⬇️
src/module.c 9.59% <0.00%> (ø)
src/config.c 78.24% <69.23%> (-0.16%) ⬇️

... and 23 files with indirect coverage changes

@hwware hwware force-pushed the maxmemory-reserved-parameter branch from b8a98df to ba13a0c Compare August 29, 2024 14:50
@hwware hwware force-pushed the maxmemory-reserved-parameter branch from ba13a0c to e27e9de Compare September 9, 2024 16:16
@enjoy-binbin
Copy link
Member

this seem like a interesting feature, did not review, just drop a comment that approve the concept.

@madolson
Copy link
Member

I also like the idea. I just haven't really spent enough time thinking about it. Memory management is a big area in Valkey we need to improve.

@hwware hwware force-pushed the maxmemory-reserved-parameter branch from e27e9de to 0a68013 Compare September 16, 2024 17:10
@hwware hwware force-pushed the maxmemory-reserved-parameter branch from 0a68013 to 9596c78 Compare September 24, 2024 01:14
Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick look.

@PingXie
Copy link
Member

PingXie commented Sep 25, 2024

+1 on introducing "back pressure" earlier. I feel that this could be used with `maxmemory_eviction_tenacity" to give an even smoother eviction experience.

@hwware hwware force-pushed the maxmemory-reserved-parameter branch 2 times, most recently from 35437eb to db966fe Compare September 26, 2024 02:46
Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hwware! LGTM overall. Can you add some tests too?

@hwware
Copy link
Member Author

hwware commented Oct 2, 2024

Thanks @hwware! LGTM overall. Can you add some tests too?

Sure, ready to work, Thanks

@hwware hwware force-pushed the maxmemory-reserved-parameter branch 4 times, most recently from f7cc8c8 to 7a5584b Compare October 7, 2024 09:31
src/evict.c Outdated
@@ -398,11 +398,12 @@ int getMaxmemoryState(size_t *total, size_t *logical, size_t *tofree, float *lev
if (total) *total = mem_reported;

/* We may return ASAP if there is no need to compute the level. */
if (!server.maxmemory) {
if (!server.maxmemory_available) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, I feel that it is more preferable to model this new setting as a "soft" maxmemory, which can trigger key eviction earlier but won't cause OOM to be returned before hitting the actual maxmemory. Otherwise, we effectively create an alias of maxmemory. More specifically, I think performEviction should only return EVICT_FAIL when the memory usage goes beyond the real maxmemory.

Additionally, since getMaxmemoryState is also used outside of the OOM prevention path such as in VM_GetUsedMemoryRatio, we should consider parameterizing maxmemory and having it passed in by the caller instead, so that we can maintain the same semantics in these externally facing scenarios.

Thoughts?

Copy link
Member Author

@hwware hwware Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like what you suggested, if we can think when "maxmemory_available" is available, it is a soft maxmemory. Then you maybe think we should not return OOM, in the case, how we can return message to client, any idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right - how about just processing the command normally?

  1. if used_memory is below soft max, no change to the existing logic
  2. if used_memory is above soft max but below hard max, trigger key eviction and continue with the normal processing
  3. if used_memory is above hard max, no change to the existing logic, i.e., trigger key eviction and fail the command if the used_memory is still above hard max after the eviction)

@hwware hwware force-pushed the maxmemory-reserved-parameter branch 3 times, most recently from 200b203 to 4da32a2 Compare October 18, 2024 05:54
@hwware hwware force-pushed the maxmemory-reserved-parameter branch from 4da32a2 to 510265f Compare October 28, 2024 13:42
hwware added 16 commits February 6, 2025 19:14
Signed-off-by: hwware <[email protected]>
Signed-off-by: hwware <[email protected]>
@hwware
Copy link
Member Author

hwware commented Feb 6, 2025

The INFO field mem_not_counted_for_evict should include the reserved memory, right? In our docs, we say that eviction happens when

used_memory - mem_not_counted_for_evict > maxmemory

so we should make sure this formula is still correct. Docs here: https://valkey.io/topics/lru-cache/

No, i do not think so. The mem_not_counted_for_evict (aka called by freeMemoryGetNotCountedMemory())
is the size of replicas output buffers and AOF buffer from the count of used memory.

However, the reserved memory is the size memory stepped away from the maxmemory. It should not include the AOF and replica output buffer part.

@hwware hwware force-pushed the maxmemory-reserved-parameter branch from 5638c62 to 4be3b3c Compare February 6, 2025 19:16
@zuiderkwast
Copy link
Contributor

The INFO field mem_not_counted_for_evict should include the reserved memory, right? In our docs, we say that eviction happens when

used_memory - mem_not_counted_for_evict > maxmemory

so we should make sure this formula is still correct. Docs here: https://valkey.io/topics/lru-cache/

No, i do not think so. The mem_not_counted_for_evict (aka called by freeMemoryGetNotCountedMemory()) is the size of replicas output buffers and AOF buffer from the count of used memory.

However, the reserved memory is the size memory stepped away from the maxmemory. It should not include the AOF and replica output buffer part.

So.... instead we should update the docs to explain the formula like this?

used_memory - mem_not_counted_for_evict > maxmemory - maxmemory_reserved

@soloestoy
Copy link
Member

since what we actually do is early key/data eviction, how about change the name to maxmemory-key-eviction or maxmemory-dataset, that can also align with maxmemory-clients.

@zuiderkwast
Copy link
Contributor

@soloestoy maxmemory-key-eviction or maxmemory-dataset, you also suggested maxmemory-eviction-threshold. These are good names, but these names all mean the total memory to use, not a margin below maxmemory?

@hwware
Copy link
Member Author

hwware commented Feb 7, 2025

Azure uses maxmemory-reserved name is aligned with our goal, you can check the link https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices-memory-management.

And I post the screenshot here
image

So let's discuss the naming issue in weekly meeting, Thanks

Comment on lines 5789 to +5793
"maxmemory:%lld\r\n", server.maxmemory,
"maxmemory_human:%s\r\n", maxmemory_hmem,
"maxmemory_policy:%s\r\n", evict_policy,
"maxmemory_reserved:%lld\r\n", server.maxmemory_reserved,
"maxmemory_reserved_human:%s\r\n", maxmemory_reserved_hmem,
Copy link
Member

@madolson madolson Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"maxmemory:%lld\r\n", server.maxmemory,
"maxmemory_human:%s\r\n", maxmemory_hmem,
"maxmemory_policy:%s\r\n", evict_policy,
"maxmemory_reserved:%lld\r\n", server.maxmemory_reserved,
"maxmemory_reserved_human:%s\r\n", maxmemory_reserved_hmem,
"maxmemory:%lld\r\n", server.maxmemory - server.maxmemory_reserved,
"maxmemory_human:%s\r\n", maxmemory_reserved_hmem,
"maxmemory_policy:%s\r\n", evict_policy,
"maxmemory_for_key_evictions:%lld\r\n", server.maxmemory - server.maxmemory_reserved,
"maxmemory_for_key_evictions_human:%s\r\n", <this needs to be computed I guess>,

I would like to consider this. from a user perspective, they want to see the actual limit, they don't want to do the math themselves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like to do the mathematics as well. Agree with you. I will update it after we finalize the naming for this pr. Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users still need to do some math to know when eviction happens. mem_not_counted_for_evict is part of the formula.

The math will be easier for users if we add a field like free_memory_below_eviction_threshold defined (if my math is right) like this:

maxmemory - maxmemory_reserved - used_memory + mem_not_counted_for_evict

With a field like this, users can just check this to know how near eviction is.

I remember this doc PR and linked issue: valkey-io/valkey-doc#194. It's the problem for users to understand this math.

Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core logic looks good overall, but I still have a few open comments that need to be addressed.

Looks like we will close on the naming and metrics in the next meeting so I skipped them.

@@ -1279,6 +1279,13 @@ acllog-max-len 128
#
# maxmemory-policy noeviction

# `maxmemory-reserved` specifies a fixed amount of memory set aside from `maxmemory`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am good with changing <bytes> to 0 but then we should be explicit in the comments that this config expects "bytes"

Suggested change
# `maxmemory-reserved` specifies a fixed amount of memory set aside from `maxmemory`.
# `maxmemory-reserved` specifies a fixed amount of memory in bytes set aside from `maxmemory`.

serverLog(LL_WARNING, "WARNING: the difference between memory usage and maxmemory is less than reserved memory. "
"This will result in key eviction depending on the maxmemory-policy. But server can still accept new write commands.");
}
startEvictionTimeProc();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is confusing to trigger the key eviction cron job when updating maxmemory-reserveed, since we already said this new config is all about "oppurtunistic" key eviction on the data access path.

@@ -3329,6 +3344,7 @@ standardConfig static_configs[] = {

/* Unsigned Long Long configs */
createULongLongConfig("maxmemory", NULL, MODIFIABLE_CONFIG, 0, ULLONG_MAX, server.maxmemory, 0, MEMORY_CONFIG, NULL, updateMaxmemory),
createULongLongConfig("maxmemory-reserved", NULL, MODIFIABLE_CONFIG, 0, ULLONG_MAX, server.maxmemory_reserved, 0, MEMORY_CONFIG, NULL, updateMaxmemoryReserved),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we used negative values for percentage already? I vaguely remember we did but not 100% sure. Also have we used the negative value hack for things other percentage?

I am good either way (adding percentage support or not).

Comment on lines +545 to 548
} else {
result = EVICT_OK; /* used_memory greater than key_eviction_memory, but not reach OOM */
}
goto update_metrics;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still want to perform eviction when used_memory is greater than key_eviction_memory I think

Suggested change
} else {
result = EVICT_OK; /* used_memory greater than key_eviction_memory, but not reach OOM */
}
goto update_metrics;
goto update_metrics;
}

@@ -697,7 +706,7 @@ int performEvictions(void) {
* across the dbAsyncDelete() call, while the thread can
* release the memory all the time. */
if (server.lazyfree_lazy_eviction) {
if (getMaxmemoryState(NULL, NULL, NULL, NULL) == C_OK) {
if (getMaxmemoryState(NULL, NULL, NULL, NULL, server.key_eviction_memory) == C_OK) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - updating mem_used here essentially moves the goal post. I think the variable name (mem_used) is the source of my confusion. agreed with you that we don't need to update mem_used

@madolson
Copy link
Member

Decision: We're going to wait until after RC1 to decide what to do about this. It sounds like @hwware had some internal use case where this was important related to storing data on disk. Wen, can you follow up with the specific use case you had in mind for this?

@hwware
Copy link
Member Author

hwware commented Feb 10, 2025

Decision: We're going to wait until after RC1 to decide what to do about this. It sounds like @hwware had some internal use case where this was important related to storing data on disk. Wen, can you follow up with the specific use case you had in mind for this?

I just update the top comment and give one specific example.
And I misremembered the scenario, we use this parameter in our internal with the caching system, not storage cases.

@soloestoy
Copy link
Member

@hwware I looked at the issue you described, and I think it's because we currently only attempt to perform eviction when executing commands (including reading commands). I think we should also check maxmemory and execute startEvictionTimeProc in serverCron, which could also resolve the issue you mentioned. Of course, maxmemory-key-eviction is still necessary.

@hwware
Copy link
Member Author

hwware commented Feb 11, 2025

startEvictionTimeProc

I do not want to add startEvictionTimeProc() in the serverCron because some unnecessary eviction process could be triggered.
One example:

Assume we have maxmemory 10GB, maxmemory-key-eviction 9GB.
At some timestamp, Valkey reaches 8.9 GB, and a new write command with 300 MB key executes successfully.
Thus the server used memory is 9.2 GB, which is over the maxmemory-key-eviction setting.

And the server has not received the write command for a long time, then theoretically, there is no key eviction and used memory keeps on 9.2GB.

However, if we just add startEvictionTimeProc() in the serverCron, key eviction process could happen, some keys will be evicted from the server.

@madolson
Copy link
Member

And the server has not received the write command for a long time, then theoretically, there is no key eviction and used memory keeps on 9.2GB.

Is staying at 200MB over maxmemory an issue? Basically Redis was designed to have a soft maxmemory limit, in that you can exceed it. You should set maxmemory so that you have some amount of additional buffer on the host. I agree with Zhao that if we think that temporarily breaching that limit is a problem, we should probably consider doing eviction after the write command was executed.

If Valkey server is working on a heavy-write process with key-value pair is big (200MB), and we increase maxmemory-eviction-tenacity to adapt to this situation. Before we apply this PR, the following actions will happen:
step 0: used memory is 10GB
step 1: a new write command processes successfully, used memory reaches 10.2GB (no eviction happens)
step 2: a new write command comes, 150MB key is evicted, and used memory is 10.05GB, OOM happens
step 3: User retries the previous write command, another 150MB key is evicted. And used memory is just below 10GB,
the write command can be processed successfully

I don't think this is how Valkey behaves. In step 2, it will keep evicting keys until it's below 10GB. If we start the backround job, we will still accept the command. We only mark the command as OOM if performEvictions function returns EVICT_FAIL.

@hwware
Copy link
Member Author

hwware commented Feb 13, 2025

Is staying at 200MB over maxmemory an issue? Basically Redis was designed to have a soft maxmemory limit, in that you can exceed it. You should set maxmemory so that you have some amount of additional buffer on the host. I agree with Zhao that if we think that temporarily breaching that limit is a problem, we should probably consider doing eviction after the write command was executed.

I think we should consider this problem from system perspective or user perspective if the 200MB over maxmemory is feasible.
Assume there is one use case: after one write command is executed, 200MB over maxmemory is reached. At the time, if we execute key eviction, 204,800 keys (1KB per key) could be evicted. It means if there is no more write commands, user will not find 204,800 keys in the server.
DO we think this case is reasonable?

If Valkey server is working on a heavy-write process with key-value pair is big (200MB), and we increase maxmemory-eviction-tenacity to adapt to this situation. Before we apply this PR, the following actions will happen:
step 0: used memory is 10GB
step 1: a new write command processes successfully, used memory reaches 10.2GB (no eviction happens)
step 2: a new write command comes, 150MB key is evicted, and used memory is 10.05GB, OOM happens
step 3: User retries the previous write command, another 150MB key is evicted. And used memory is just below 10GB,
the write command can be processed successfully

I don't think this is how Valkey behaves. In step 2, it will keep evicting keys until it's below 10GB. If we start the backround job, we will still accept the command. We only mark the command as OOM if performEvictions function returns EVICT_FAIL.

Yes, I agree with you, this is currently how Valkey works. And my pr could improve it as well.

For example:
In the 1st CPU cycle, server is 200MB over maxmemory. A new write command is coming, key eviction is needed. But unfortunately, during the key eviction process, EVICT_FAIL is thrown, the user will receive the OOM.
However, if we have maxmemory_key_eviction, when server reaches it, even during the eviction process, no proper key is found (if (bestkey) is false), we still allow write command executed, and there is no OOM thrown.

Secondly, my pr is much more friendly to the big key. Assume there is a big key 200MB coming, and for our current design, if only 20MB keys could be evicted per CPU cycle. Thus, the big key need wait for 10 CPU cycle. And in my PR, after one CPU cycle, the 200MB big key can be written to the server and do not need wait for another 9 CPU cycle.

I am not sure if above use case could convince you. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: After RC1
Development

Successfully merging this pull request may close these issues.

7 participants