defrag: eliminate persistent kvstore pointer and edge case fixes #1430

JimB123 · 2024-12-12T07:00:15Z

This update addresses several issues in defrag:

In the defrag redesign (Refactor of ActiveDefrag to reduce latencies #1242), a bug was introduced where server.cronloops was no longer being incremented in the whileBlockedCron(). This resulted in some memory statistics not being updated while blocked.
In the test case for AOF loading, we were seeing errors due to defrag latencies. However, running the math, the latencies are justified given the extremely high CPU target of the testcase. Adjusted the expected latency check to allow longer latencies for this case where defrag is undergoing starvation while AOF loading is in progress.
A "stage" is passed a "target". For the main dictionary and expires, we were passing in a kvstore*. However, on flushall or swapdb, the pointer may change. It's safer and more stable to use an index for the DB (a DBID). Then if the pointer changes, we can detect the change, and simply abort the stage. (If there's still fragmentation to deal with, we'll pick it up again on the next cycle.)
We always start a new stage on a new defrag cycle. This gives the new stage time to run, and prevents latency issues for certain stages which don't operate incrementally. However, often several stages will require almost no work, and this will leave a chunk of our CPU allotment unused. This is mainly an issue in starvation situations (like AOF loading or LUA script) - where defrag is running infrequently, with a large duty-cycle. This change allows a new stage to be initiated if we still have a standard duty-cycle remaining. (This can happen during starvation situations where the planned duty cycle is larger than the standard cycle. Most likely this isn't a concern for real scenarios, but it was observed in testing.)
Minor comment correction in server.h

codecov · 2024-12-12T07:16:13Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.90%. Comparing base (0c8ad5c) to head (005deba).
Report is 5 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1430      +/-   ##
============================================
+ Coverage     70.76%   70.90%   +0.14%     
============================================
  Files           119      119              
  Lines         64618    64642      +24     
============================================
+ Hits          45728    45836     +108     
+ Misses        18890    18806      -84

Files with missing lines	Coverage Δ
src/defrag.c	`89.72% <100.00%> (-0.96%)`	⬇️
src/server.c	`87.44% <100.00%> (+0.01%)`	⬆️
src/server.h	`100.00% <ø> (ø)`

... and 14 files with indirect coverage changes

src/defrag.c

Signed-off-by: Jim Brunner <[email protected]>

hpatro

The change looks good to me. But if I understand correctly there isn't enough test coverage to discover such issues. Should we add more test with defrag to avoid things like this in future?

hpatro · 2024-12-12T22:38:54Z

src/defrag.c

+    int dutyCycleUs = computeDefragCycleUs();
+    monotime endtime = starttime + dutyCycleUs;
+    bool haveMoreWork = true;


I think we were avoiding camel cased variables in methods. But the clang format check seems to be not complaining.

hpatro · 2024-12-12T22:55:16Z

Talked to @JimB123, he mentioned with this change #1303, it should help us discover some of the above issues. Currently the testing with defrag is miniscule.

…key-io#1430) This update addresses several issues in defrag: 1. In the defrag redesign (valkey-io#1242), a bug was introduced where `server.cronloops` was no longer being incremented in the `whileBlockedCron()`. This resulted in some memory statistics not being updated while blocked. 2. In the test case for AOF loading, we were seeing errors due to defrag latencies. However, running the math, the latencies are justified given the extremely high CPU target of the testcase. Adjusted the expected latency check to allow longer latencies for this case where defrag is undergoing starvation while AOF loading is in progress. 3. A "stage" is passed a "target". For the main dictionary and expires, we were passing in a `kvstore*`. However, on flushall or swapdb, the pointer may change. It's safer and more stable to use an index for the DB (a DBID). Then if the pointer changes, we can detect the change, and simply abort the stage. (If there's still fragmentation to deal with, we'll pick it up again on the next cycle.) 4. We always start a new stage on a new defrag cycle. This gives the new stage time to run, and prevents latency issues for certain stages which don't operate incrementally. However, often several stages will require almost no work, and this will leave a chunk of our CPU allotment unused. This is mainly an issue in starvation situations (like AOF loading or LUA script) - where defrag is running infrequently, with a large duty-cycle. This change allows a new stage to be initiated if we still have a standard duty-cycle remaining. (This can happen during starvation situations where the planned duty cycle is larger than the standard cycle. Most likely this isn't a concern for real scenarios, but it was observed in testing.) 5. Minor comment correction in `server.h` Signed-off-by: Jim Brunner <[email protected]>

JimB123 commented Dec 12, 2024

View reviewed changes

src/defrag.c Outdated Show resolved Hide resolved

JimB123 force-pushed the defrag-ptr-to-idx branch from ce3b933 to ed3d8e9 Compare December 12, 2024 19:39

JimB123 marked this pull request as draft December 12, 2024 19:46

defrag: eliminate persistent pointers

005deba

Signed-off-by: Jim Brunner <[email protected]>

JimB123 force-pushed the defrag-ptr-to-idx branch from ed3d8e9 to 005deba Compare December 12, 2024 20:19

JimB123 changed the title ~~defrag: eliminate persistent pointers~~ defrag: eliminate persistent pointers edge case fixes Dec 12, 2024

JimB123 changed the title ~~defrag: eliminate persistent pointers edge case fixes~~ defrag: eliminate persistent pointers and edge case fixes Dec 12, 2024

JimB123 changed the title ~~defrag: eliminate persistent pointers and edge case fixes~~ defrag: eliminate persistent kvstore pointer and edge case fixes Dec 12, 2024

JimB123 marked this pull request as ready for review December 12, 2024 20:35

hpatro reviewed Dec 12, 2024

View reviewed changes

hpatro requested a review from madolson December 12, 2024 22:55

hpatro approved these changes Dec 12, 2024

View reviewed changes

hpatro merged commit 32f2c73 into valkey-io:unstable Dec 12, 2024
48 checks passed

JimB123 mentioned this pull request Jan 13, 2025

Cancel activedefrag in case we empty db #1411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

defrag: eliminate persistent kvstore pointer and edge case fixes #1430

defrag: eliminate persistent kvstore pointer and edge case fixes #1430

JimB123 commented Dec 12, 2024 •

edited

Loading

codecov bot commented Dec 12, 2024 •

edited

Loading

hpatro left a comment

hpatro Dec 12, 2024

hpatro commented Dec 12, 2024

defrag: eliminate persistent kvstore pointer and edge case fixes #1430

defrag: eliminate persistent kvstore pointer and edge case fixes #1430

Conversation

JimB123 commented Dec 12, 2024 • edited Loading

codecov bot commented Dec 12, 2024 • edited Loading

Codecov Report

hpatro left a comment

Choose a reason for hiding this comment

hpatro Dec 12, 2024

Choose a reason for hiding this comment

hpatro commented Dec 12, 2024

JimB123 commented Dec 12, 2024 •

edited

Loading

codecov bot commented Dec 12, 2024 •

edited

Loading