Slot migration improvement #245

PingXie · 2024-04-06T19:52:16Z

Overview

This PR significantly enhances the reliability and automation of the Valkey cluster re-sharding process, specifically during slot migrations in the face of primary failures. These updates address critical failure issues that previously required extensive manual intervention and could lead to data loss or inconsistent cluster states.

Enhancements

Automatic Failover Support in Empty Shards

The cluster now supports automatic failover in shards that do not own any slots, which is common during scaling operations. This improvement ensures high availability and resilience from the outset of shard expansion.

Replication of Slot Migration States

All CLUSTER SETSLOT commands are now initially executed on replica nodes before the primary. This ensures that the slot migration state is consistent within the shard, preventing state loss in the event of primary failure. A new timeout parameter has been introduced, allowing users to specify the duration in milliseconds to wait for replication to complete, with a default set at 2 seconds.

CLUSTER SETSLOT slot { IMPORTING node-id | MIGRATING node-id | NODE node-id | STABLE } [ TIMEOUT timeout ]

Recovery of Logical Migration Links

The update automatically repairs the logical links between source and target nodes during failovers. This ensures that requests are correctly redirected to the new primary in the target shard after a primary failure, maintaining cluster integrity.

Enhanced Support for New Replicas

New replicas added to shards involved in slot migrations will now automatically inherit the slot's migration state as part of their initialization. This ensures that new replicas are immediately consistent with the rest of the shard.

Improved Logging for Slot Migrations

Additional logging has been implemented to provide operators with clearer insights into the slot migration processes and automatic recovery actions, aiding in monitoring and troubleshooting.

Additional Changes

cluster-allow-replica-migration

When cluster-allow-replica-migration is disabled, primary nodes that lose their last slot to another shard will no longer automatically become replicas of the receiving shard. Instead, they will remain in their own shards, which will now be empty, having no slots assigned to them.

Fix #21

codecov · 2024-04-20T20:41:16Z

Codecov Report

Attention: Patch coverage is 82.82443% with 45 lines in your changes are missing coverage. Please review.

Project coverage is 68.89%. Comparing base (93f8a19) to head (6f459da).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #245      +/-   ##
============================================
+ Coverage     68.43%   68.89%   +0.45%     
============================================
  Files           109      109              
  Lines         61681    61785     +104     
============================================
+ Hits          42214    42566     +352     
+ Misses        19467    19219     -248

Files	Coverage Δ
src/commands.def	`100.00% <ø> (ø)`
src/debug.c	`53.47% <100.00%> (+0.65%)`	⬆️
src/networking.c	`85.08% <100.00%> (ø)`
src/rdb.c	`75.79% <100.00%> (-0.47%)`	⬇️
src/replication.c	`86.33% <100.00%> (+0.03%)`	⬆️
src/server.c	`88.60% <100.00%> (+0.47%)`	⬆️
src/blocked.c	`91.80% <91.66%> (-0.06%)`	⬇️
src/cluster_legacy.c	`83.22% <81.11%> (+8.53%)`	⬆️

... and 10 files with indirect coverage changes

PingXie · 2024-04-22T17:25:30Z

@valkey-io/core-team ready for your review

zuiderkwast · 2024-04-22T21:52:21Z

@PingXie I updated the PR description. Please check if it's correct and edit again if it's not.

zuiderkwast

Partial review. (I've already reviewed it earlier and approved it, but it's long ago.)

src/valkey-cli.c

valkey.conf

PingXie · 2024-04-23T03:57:29Z

I think my previous fix (redis/redis#13055) introduced a race condition. More specifically, I don't think it is a good idea to move the slots on L3015, Instead, I should've always routed the topology update through clusterUpdateSlotsConfigWith. As a result, the logic attempting to fix the import source during a source primary failover is bypassed if the old primary (new replica's) PONG message races ahead of the new primary (old replica)'s.

@PingXie I updated the PR description. Please check if it's correct and edit again if it's not.

Thanks @zuiderkwast! I will write the proper PR description next.

src/cluster_legacy.c

zuiderkwast · 2024-04-28T03:05:41Z

Two commits need sign off. They're called "Update src/valkey-cli.c".

PingXie · 2024-04-28T04:33:40Z

I think my previous fix (redis/redis#13055) introduced a race condition. More specifically, I don't think it is a good idea to move the slots on L3015, Instead, I should've always routed the topology update through clusterUpdateSlotsConfigWith. As a result, the logic attempting to fix the import source during a source primary failover is bypassed if the old primary (new replica's) PONG message races ahead of the new primary (old replica)'s.

This should now be fixed. I don't think we need to backport the change to Valkey 7.2. The original fix should still work. It is just that the new reliability improvements in this PR require all topology updates be done in clusterUpdateSlotsConfigWith.

Next step is to (re)introduce the server-initiate wait and get rid of the REPLICAONLY flag.

Signed-off-by: Ping Xie <[email protected]>

Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Ping Xie <[email protected]>

Signed-off-by: Ping Xie <[email protected]>

…nction Signed-off-by: Ping Xie <[email protected]>

Signed-off-by: Ping Xie <[email protected]>

`CLUSTER SETSLOT NODE` Signed-off-by: Ping Xie <[email protected]>

hpatro

Mostly nit-picks. Pretty difficult to wrap my head around these complex conditions around slot ownership. Will re-read the code few times :D.

We should apply the formatting (planned to be added) on the new code changes. I see parantheses opening, spacing after , different from expected in few places.
We are using primary/replica in certain place(s) and master/slave terminology in few place(s) in code comments. Maybe just use primary/replica.

src/blocked.c

src/cluster_legacy.c

.gitignore

src/cluster_legacy.c

src/replication.c

src/cluster_legacy.c

PingXie · 2024-04-30T03:34:51Z

Thanks for the review, @hpatro. There is definitely inconsistency in this PR as it has gone through too many rounds of refactoring. I will clean up the naming and comments as part of the timeout enhancement but I will leave the coding style and terminology (primary/replica) to separate PRs which I plan to address across the entire codebase.

src/cluster_legacy.c

PingXie · 2024-04-30T22:03:07Z

Quick poll on the new timeout parameter.

We talked about allowing the client to control the wait time (with a default of 5s) on a per command basis for "CLUSTER SETSLOT", which now pre-replicates the command to the replicas first. QThis new parameter can only be added to the end of the command to maintain backward compat. I think a cleaner change would be introducing a "timeout" token such as "timeout 1000" but looks like this is not a pattern used in existing commands like "WAIT" where the timeout parameter is expected to be at a fixed location. Thoughts?

hpatro · 2024-04-30T22:20:57Z

For newer commands e.g. https://valkey.io/commands/xread/, we went with what you're suggesting above i.e. first the keyword and then followed by the actual value. The parsing logic becomes a slightly more complex but gives more flexibility. I think it's reasonable to proceed with your suggestion.

Signed-off-by: Ping Xie <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]>

Initial PR to add a governance doc outlining permissions for the main Valkey project as well as define responsibilities for sub-projects. --------- Signed-off-by: Madelyn Olson <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Ping Xie <[email protected]> Co-authored-by: zhaozhao.zz <[email protected]> Co-authored-by: hwware <[email protected]> Co-authored-by: binyan <[email protected]

Fix the mem_freed variable to be initialized with init. with this PR prevents the variable from acting unknowingly. Signed-off-by: NAM UK KIM <[email protected]>

Delete unused declaration `void *dictEntryMetadata(dictEntry *de);` in dict.h. --------- Signed-off-by: Lipeng Zhu <[email protected]>

Serverassert is a drop-in replacement of assert. We use it even in code copied from other sources. To make these files usable outside of Valkey, it should be enough to replace the `serverassert.h` include with `<assert.h>`. Therefore, this file shouldn't have any dependencies to the rest of the valkey code. --------- Signed-off-by: Viktor Söderqvist <[email protected]>

Improve the performance of crc64 for large batches by processing large number of bytes in parallel and combining the results. ## Performance * 53-73% faster on Xeon 2670 v0 @ 2.6ghz * 2-2.5x faster on Core i3 8130U @ 2.2 ghz * 1.6-2.46 bytes/cycle on i3 8130U * likely >2x faster than crcspeed on newer CPUs with more resources than a 2012-era Xeon 2670 * crc64 combine function runs in <50 nanoseconds typical with vector + cache optimizations (~8 *microseconds* without vector optimizations, ~80 *microseconds without cache, the combination is extra effective) * still single-threaded * valkey-server test crc64 --help (requires `make distclean && make SERVER_TEST=yes`) --------- Signed-off-by: Josiah Carlson <[email protected]> Signed-off-by: Madelyn Olson <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>

…of-rewrite (valkey-io#393) Renamed redis to valkey/server in aof.c serverlogs. The AOF rewrite child process title is set to "redis-aof-rewrite" if Valkey was started from a redis-server symlink, otherwise to "valkey-aof-rewrite". This is a breaking changes since logs are changed. Part of valkey-io#207. --------- Signed-off-by: Shivshankar-Reddy <[email protected]>

This is a minor change where only naming and links now points properly to valkey. Fixes valkey-io#388 --------- Signed-off-by: Rolandas Šimkus <[email protected]> Signed-off-by: simkusr <[email protected]> Signed-off-by: simkusr <[email protected]> Signed-off-by: Viktor Söderqvist <[email protected]> Co-authored-by: simkusr <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]>

These JSON files were originally not intended to be used directly, since they contain internals and some fiels like "acl_categories" that are not the final ACL categories. (Valkey will apply some implicit rules to compute the final ACL categories.) However, people see JSON files and use them directly anyway. So it's better to document them. In a later PR, we can get rid of all implicit ACL categories and instead populate them explicitly in the JSON files. Then, we'll add a validation (e.g. in generate-command-code.py) that the implied categories are set. --------- Signed-off-by: Viktor Söderqvist <[email protected]> Co-authored-by: Binbin <[email protected]>

Signed-off-by: Ikko Eltociear Ashimine <[email protected]>

Signed-off-by: Ping Xie <[email protected]>

madolson

Just tests left for review.

madolson · 2024-05-03T02:45:40Z

src/cluster_legacy.c

+                    /* We intentionally avoid updating myself's configEpoch when
+                    * taking ownership of this slot. This approach is effective
+                    * in scenarios where my primary crashed during the slot
+                    * finalization process. I became the new primary without
+                    * inheriting the slot ownership, while the source shard
+                    * continued and relinquished the slot.


I feel very uncomfortable about just breaking core assumptions about the algorithm here (specifically omitting the configEpoch). It really feels like we should be bumping the epoch here, even if it means we steal slots unnecessarily. (This is an edge case after all) We are explicitly prioritizing availability here.

I think it is equally incorrect, if not more, to bump the config epoch without going through consensus. The impact on availability is the same, bumping the epoch or not, because this node will assume the ownership of this slot. In this case, the more likely case will be what I explained in the comments that follow (copied below too)

* By not increasing myself's configEpoch, we ensure that * if the slot is correctly migrated to another primary, I * will not mistakenly claim ownership. Instead, any ownership * conflicts will be resolved accurately based on configEpoch * values. */

I think it is equally incorrect, if not more, to bump the config epoch without going through consensus.

This isn't true, there is nothing wrong with bumping the epoch without consensus. Consensus is the preferred way to do it to avoid unnecessary churn, but it's not a pre-requisite.

The impact on availability is the same, bumping the epoch or not, because this node will assume the ownership of this slot. In this case, the more likely case will be what I explained in the comments that follow (copied below too)

I agree, that is why I'm saying we should bump the epoch. There was a change to the slot ownership, that is supposed to come with an epoch bump.

This isn't true, there is nothing wrong with bumping the epoch without consensus. Consensus is the preferred way to do it to avoid unnecessary churn, but it's not a pre-requisite.

A primary case that I am trying to handle is where the node in question becomes a primary without knowing that the source shard has just given up its slot ownership to its old primary since the old primary just failed. For this node to become the new primary, it would have to go through an election and get its config epoch bumped already (with a consensus). However, it is also conceivable that the source shard might have given the slot to a different shard and it is in this case where I still think it is safer to not blindly bump the config epoch because it would undo the true indent; while in the primary failure case we don't need to bump the config epoch, because it is a side product of the failover.

Btw, in all existing cases where we bump the config epoch without consensus, we have a clear intent to take over the slot ownership and this is the key difference IMO.

Btw, in all existing cases where we bump the config epoch without consensus, we have a clear intent to take over the slot ownership and this is the key difference IMO.

You're prescribing intent to an algorithm. The epoch get's bumped every time the slot ownership is changed, because we are in a new-epoch with new state.

I still think it is safer to not blindly bump the config epoch because it would undo the true indent;

But you're okay with blindly serving data without true intent? I suppose that is my problem with this, either we are committing to the change with the epoch bump or we shouldn't be serving traffic.

src/cluster_legacy.c

tests/unit/cluster/slot-migration.tcl

madolson · 2024-05-03T03:18:44Z

tests/unit/cluster/slot-migration.tcl

+}
+
+# restart a server and wait for it to come back online
+proc restart_server_and_wait {server_id} {


Almost universally we de-schedule processes by using the pause helpers to trigger failovers. This does two things, one is that we don't have to restart the process manually later, and second is that it's similar to when a processes disappears for a bit so tests additional modes.

Is there a specific reason we are restarting here?

Almost universally we de-schedule processes by using the pause helpers to trigger failovers.

is this pause_process? it kills the process and a restart is needed too. or this is something else?

It just pauses the process, I was asking if we specifically need to kill it here?

I see. my intent is to simulate true failures. If I don't kill the server, I will have to use "graceful failover" to switch roles.

If the server is paused (SIGPAUSE), it doesn't respond to cluster messages and this leads to an automatic failover, in the same way as if the network were broken. I think simulates a true failure.

Yeah, the segfault seems to be causing more valgrind issues as well. I think we should re-evaluate this decision.

madolson · 2024-05-03T03:23:39Z

tests/unit/cluster/slot-migration.tcl

+        assert {$duration > 2000}
+
+        # Setslot should fail with not enough good replicas to write after the timeout
+        assert_equal {NOREPLICAS Not enough good replicas to write.} $e


Yeah, it's more weird behavior, but for now it seems acceptable.

Signed-off-by: Ping Xie <[email protected]>

PingXie · 2024-05-03T04:41:05Z

Yeah, it's more weird behavior, but for now it seems acceptable.

I am not sure if I agree with the "more weird behavior" statement. There is actually a coherent story to tell/document here. The cluster setslot command is now pre-replicated to all replicas synchronously. If one or more replicas failed to acknowledge the command, returning "Not enough good replicas to write" seems very reasonable to me. I also don't want to introduce a specific error just for cluster setslot. Do you have a different error message in mind?

Signed-off-by: Ping Xie <[email protected]>

PingXie · 2024-05-03T04:49:00Z

Yeah, it's more weird behavior, but for now it seems acceptable.

I am not sure if I agree with the "more weird behavior" statement. There is actually a coherent story to tell/document here. The cluster setslot command is now pre-replicated to all replicas synchronously. If one or more replicas failed to acknowledge the command, returning "Not enough good replicas to write" seems very reasonable to me. I also don't want to introduce a specific error just for cluster setslot. Do you have a different error message in mind?

Okay now I think this is "weird behavior" on the github side :). It apparently misplaced your comment and made it look like it was about the error message but after I clicked into the code, I saw that you were commenting about the "torn" states. Please ignore my response above.

tests/unit/cluster/slot-migration.tcl

Signed-off-by: Ping Xie <[email protected]>

PingXie · 2024-05-07T00:16:43Z

Sorry I can't fix the DCO. Shouldn't have rebased. Closing this PR and moving changes to #445.

PingXie force-pushed the slot_migration branch from 620fd99 to a6891f4 Compare April 6, 2024 19:53

PingXie marked this pull request as ready for review April 22, 2024 17:24

zuiderkwast reviewed Apr 22, 2024

View reviewed changes

src/valkey-cli.c Outdated Show resolved Hide resolved

src/valkey-cli.c Outdated Show resolved Hide resolved

valkey.conf Show resolved Hide resolved

madolson reviewed Apr 23, 2024

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

PingXie commented Apr 23, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

zuiderkwast reviewed Apr 28, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

zuiderkwast reviewed Apr 28, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

PingXie force-pushed the slot_migration branch from b2b5444 to a475d1a Compare April 29, 2024 04:06

PingXie and others added 9 commits April 28, 2024 21:09

Slot migration improvement

c11b65d

Signed-off-by: Ping Xie <[email protected]>

Add missing changes

03196a4

Signed-off-by: Ping Xie <[email protected]>

Improve test reliability

9b95b1e

Signed-off-by: Ping Xie <[email protected]>

Update src/valkey-cli.c

f4e84d0

Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Ping Xie <[email protected]>

Update src/valkey-cli.c

fac9a06

Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Ping Xie <[email protected]>

Consolidate cluster topology update to clusterUpdateSlotsConfigWith

2d004b3

Signed-off-by: Ping Xie <[email protected]>

Fix a typo and move gossip packet validation logic into a separate fu…

4cef2b4

…nction Signed-off-by: Ping Xie <[email protected]>

Fix build builds and flaky tests

10944e6

Signed-off-by: Ping Xie <[email protected]>

Reintroduce server-initiated wait and remove the REPLICAONLY flag for

907de8c

`CLUSTER SETSLOT NODE` Signed-off-by: Ping Xie <[email protected]>

PingXie force-pushed the slot_migration branch from a475d1a to 907de8c Compare April 29, 2024 04:09

hpatro reviewed Apr 29, 2024

View reviewed changes

zuiderkwast reviewed Apr 30, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

Merge branch 'valkey-io:unstable' into slot_migration

58108ea

PingXie and others added 10 commits May 2, 2024 15:37

Update src/commands/cluster-setslot.json

aa54049

Signed-off-by: Ping Xie <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]>

Modify mem_freed variable in evict.c and Update debug.c (valkey-io#376)

f7e7339

Fix the mem_freed variable to be initialized with init. with this PR prevents the variable from acting unknowingly. Signed-off-by: NAM UK KIM <[email protected]>

Delete unused declaration (valkey-io#400)

b6799bd

Delete unused declaration `void *dictEntryMetadata(dictEntry *de);` in dict.h. --------- Signed-off-by: Lipeng Zhu <[email protected]>

Fix typo in comment in quicklist.h (valkey-io#416)

f36330a

Signed-off-by: Ikko Eltociear Ashimine <[email protected]>

PingXie force-pushed the slot_migration branch from 1150a9c to f36330a Compare May 2, 2024 22:37

PingXie and others added 2 commits May 2, 2024 15:44

update commands.def

4f55dc2

Signed-off-by: Ping Xie <[email protected]>

Merge branch 'valkey-io:unstable' into slot_migration

4ddf884

madolson reviewed May 3, 2024

View reviewed changes

PingXie mentioned this pull request May 3, 2024

[NEW] Consolidate wait states #427

Closed

madolson reviewed May 3, 2024

View reviewed changes

Improve code readability

e42cb94

Signed-off-by: Ping Xie <[email protected]>

Incorporate review feedback

ad5c7f8

Signed-off-by: Ping Xie <[email protected]>

zuiderkwast reviewed May 3, 2024

View reviewed changes

tests/unit/cluster/slot-migration.tcl Show resolved Hide resolved

zuiderkwast mentioned this pull request May 3, 2024

Replicas should indicate their version to their Primaries during handshake #414

Closed

PingXie and others added 4 commits May 6, 2024 00:25

Reintroduce a proper timeout to help stablize the cluster

11fc143

Signed-off-by: Ping Xie <[email protected]>

update code comment

a9fe543

Signed-off-by: Ping Xie <[email protected]>

Bump the config epoch when taking over "orphaned" slots

8afc34c

Signed-off-by: Ping Xie <[email protected]>

Merge branch 'valkey-io:unstable' into slot_migration

6f459da

PingXie closed this May 7, 2024

PingXie deleted the slot_migration branch May 16, 2024 20:05

PingXie mentioned this pull request Jun 2, 2024

[NEW] Server driven slot migration #587

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slot migration improvement #245

Slot migration improvement #245

PingXie commented Apr 6, 2024 •

edited

Loading

codecov bot commented Apr 20, 2024 •

edited

Loading

PingXie commented Apr 22, 2024

zuiderkwast commented Apr 22, 2024

zuiderkwast left a comment

PingXie commented Apr 23, 2024

zuiderkwast commented Apr 28, 2024

PingXie commented Apr 28, 2024

hpatro left a comment

PingXie commented Apr 30, 2024

PingXie commented Apr 30, 2024

hpatro commented Apr 30, 2024

madolson left a comment

madolson May 3, 2024

PingXie May 3, 2024

madolson May 3, 2024

PingXie May 6, 2024

madolson May 6, 2024 •

edited

Loading

madolson May 6, 2024 •

edited

Loading

madolson May 3, 2024

PingXie May 3, 2024

madolson May 3, 2024

PingXie May 6, 2024

zuiderkwast May 13, 2024

madolson May 13, 2024

madolson May 3, 2024

PingXie commented May 3, 2024 •

edited

Loading

PingXie commented May 3, 2024 •

edited

Loading

PingXie commented May 7, 2024

Slot migration improvement #245

Slot migration improvement #245

Conversation

PingXie commented Apr 6, 2024 • edited Loading

Overview

Enhancements

Automatic Failover Support in Empty Shards

Replication of Slot Migration States

Recovery of Logical Migration Links

Enhanced Support for New Replicas

Improved Logging for Slot Migrations

Additional Changes

cluster-allow-replica-migration

codecov bot commented Apr 20, 2024 • edited Loading

Codecov Report

PingXie commented Apr 22, 2024

zuiderkwast commented Apr 22, 2024

zuiderkwast left a comment

Choose a reason for hiding this comment

PingXie commented Apr 23, 2024

zuiderkwast commented Apr 28, 2024

PingXie commented Apr 28, 2024

hpatro left a comment

Choose a reason for hiding this comment

PingXie commented Apr 30, 2024

PingXie commented Apr 30, 2024

hpatro commented Apr 30, 2024

madolson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madolson May 6, 2024 • edited Loading

Choose a reason for hiding this comment

madolson May 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PingXie commented May 3, 2024 • edited Loading

PingXie commented May 3, 2024 • edited Loading

PingXie commented May 7, 2024

PingXie commented Apr 6, 2024 •

edited

Loading

codecov bot commented Apr 20, 2024 •

edited

Loading

madolson May 6, 2024 •

edited

Loading

madolson May 6, 2024 •

edited

Loading

PingXie commented May 3, 2024 •

edited

Loading

PingXie commented May 3, 2024 •

edited

Loading