Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slot migration improvement #245

Closed
wants to merge 33 commits into from
Closed

Conversation

PingXie
Copy link
Member

@PingXie PingXie commented Apr 6, 2024

Overview

This PR significantly enhances the reliability and automation of the Valkey cluster re-sharding process, specifically during slot migrations in the face of primary failures. These updates address critical failure issues that previously required extensive manual intervention and could lead to data loss or inconsistent cluster states.

Enhancements

Automatic Failover Support in Empty Shards

The cluster now supports automatic failover in shards that do not own any slots, which is common during scaling operations. This improvement ensures high availability and resilience from the outset of shard expansion.

Replication of Slot Migration States

All CLUSTER SETSLOT commands are now initially executed on replica nodes before the primary. This ensures that the slot migration state is consistent within the shard, preventing state loss in the event of primary failure. A new timeout parameter has been introduced, allowing users to specify the duration in milliseconds to wait for replication to complete, with a default set at 2 seconds.

CLUSTER SETSLOT slot { IMPORTING node-id | MIGRATING node-id | NODE node-id | STABLE } [ TIMEOUT timeout ]

Recovery of Logical Migration Links

The update automatically repairs the logical links between source and target nodes during failovers. This ensures that requests are correctly redirected to the new primary in the target shard after a primary failure, maintaining cluster integrity.

Enhanced Support for New Replicas

New replicas added to shards involved in slot migrations will now automatically inherit the slot's migration state as part of their initialization. This ensures that new replicas are immediately consistent with the rest of the shard.

Improved Logging for Slot Migrations

Additional logging has been implemented to provide operators with clearer insights into the slot migration processes and automatic recovery actions, aiding in monitoring and troubleshooting.

Additional Changes

cluster-allow-replica-migration

When cluster-allow-replica-migration is disabled, primary nodes that lose their last slot to another shard will no longer automatically become replicas of the receiving shard. Instead, they will remain in their own shards, which will now be empty, having no slots assigned to them.

Fix #21

Copy link

codecov bot commented Apr 20, 2024

Codecov Report

Attention: Patch coverage is 82.82443% with 45 lines in your changes are missing coverage. Please review.

Project coverage is 68.89%. Comparing base (93f8a19) to head (6f459da).

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #245      +/-   ##
============================================
+ Coverage     68.43%   68.89%   +0.45%     
============================================
  Files           109      109              
  Lines         61681    61785     +104     
============================================
+ Hits          42214    42566     +352     
+ Misses        19467    19219     -248     
Files Coverage Δ
src/commands.def 100.00% <ø> (ø)
src/debug.c 53.47% <100.00%> (+0.65%) ⬆️
src/networking.c 85.08% <100.00%> (ø)
src/rdb.c 75.79% <100.00%> (-0.47%) ⬇️
src/replication.c 86.33% <100.00%> (+0.03%) ⬆️
src/server.c 88.60% <100.00%> (+0.47%) ⬆️
src/blocked.c 91.80% <91.66%> (-0.06%) ⬇️
src/cluster_legacy.c 83.22% <81.11%> (+8.53%) ⬆️

... and 10 files with indirect coverage changes

@PingXie PingXie marked this pull request as ready for review April 22, 2024 17:24
@PingXie
Copy link
Member Author

PingXie commented Apr 22, 2024

@valkey-io/core-team ready for your review

@zuiderkwast
Copy link
Contributor

@PingXie I updated the PR description. Please check if it's correct and edit again if it's not.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review. (I've already reviewed it earlier and approved it, but it's long ago.)

src/valkey-cli.c Outdated Show resolved Hide resolved
src/valkey-cli.c Outdated Show resolved Hide resolved
valkey.conf Show resolved Hide resolved
@PingXie
Copy link
Member Author

PingXie commented Apr 23, 2024

I think my previous fix (redis/redis#13055) introduced a race condition. More specifically, I don't think it is a good idea to move the slots on L3015, Instead, I should've always routed the topology update through clusterUpdateSlotsConfigWith. As a result, the logic attempting to fix the import source during a source primary failover is bypassed if the old primary (new replica's) PONG message races ahead of the new primary (old replica)'s.

image

@PingXie I updated the PR description. Please check if it's correct and edit again if it's not.

Thanks @zuiderkwast! I will write the proper PR description next.

src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
@zuiderkwast
Copy link
Contributor

Two commits need sign off. They're called "Update src/valkey-cli.c".

@PingXie
Copy link
Member Author

PingXie commented Apr 28, 2024

I think my previous fix (redis/redis#13055) introduced a race condition. More specifically, I don't think it is a good idea to move the slots on L3015, Instead, I should've always routed the topology update through clusterUpdateSlotsConfigWith. As a result, the logic attempting to fix the import source during a source primary failover is bypassed if the old primary (new replica's) PONG message races ahead of the new primary (old replica)'s.

image

This should now be fixed. I don't think we need to backport the change to Valkey 7.2. The original fix should still work. It is just that the new reliability improvements in this PR require all topology updates be done in clusterUpdateSlotsConfigWith.

Next step is to (re)introduce the server-initiate wait and get rid of the REPLICAONLY flag.

Copy link
Contributor

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nit-picks. Pretty difficult to wrap my head around these complex conditions around slot ownership. Will re-read the code few times :D.

  • We should apply the formatting (planned to be added) on the new code changes. I see parantheses opening, spacing after , different from expected in few places.
  • We are using primary/replica in certain place(s) and master/slave terminology in few place(s) in code comments. Maybe just use primary/replica.

src/blocked.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
.gitignore Show resolved Hide resolved
src/cluster_legacy.c Show resolved Hide resolved
src/replication.c Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
@PingXie
Copy link
Member Author

PingXie commented Apr 30, 2024

Thanks for the review, @hpatro. There is definitely inconsistency in this PR as it has gone through too many rounds of refactoring. I will clean up the naming and comments as part of the timeout enhancement but I will leave the coding style and terminology (primary/replica) to separate PRs which I plan to address across the entire codebase.

src/cluster_legacy.c Outdated Show resolved Hide resolved
@PingXie
Copy link
Member Author

PingXie commented Apr 30, 2024

Quick poll on the new timeout parameter.

We talked about allowing the client to control the wait time (with a default of 5s) on a per command basis for "CLUSTER SETSLOT", which now pre-replicates the command to the replicas first. QThis new parameter can only be added to the end of the command to maintain backward compat. I think a cleaner change would be introducing a "timeout" token such as "timeout 1000" but looks like this is not a pattern used in existing commands like "WAIT" where the timeout parameter is expected to be at a fixed location. Thoughts?

@hpatro
Copy link
Contributor

hpatro commented Apr 30, 2024

For newer commands e.g. https://valkey.io/commands/xread/, we went with what you're suggesting above i.e. first the keyword and then followed by the actual value. The parsing logic becomes a slightly more complex but gives more flexibility. I think it's reasonable to proceed with your suggestion.

PingXie and others added 10 commits May 2, 2024 15:37
Signed-off-by: Ping Xie <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]>
Initial PR to add a governance doc outlining permissions for the main
Valkey project as well as define responsibilities for sub-projects.

---------

Signed-off-by: Madelyn Olson <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
Co-authored-by: zhaozhao.zz <[email protected]>
Co-authored-by: hwware <[email protected]>
Co-authored-by: binyan <[email protected]
Fix the mem_freed variable to be initialized with init.
with this PR prevents the variable from acting unknowingly.

Signed-off-by: NAM UK KIM <[email protected]>
Delete unused declaration `void *dictEntryMetadata(dictEntry *de);` in
dict.h.

---------

Signed-off-by: Lipeng Zhu <[email protected]>
Serverassert is a drop-in replacement of assert. We use it even in code
copied from other sources. To make these files usable outside of Valkey,
it should be enough to replace the `serverassert.h` include with
`<assert.h>`. Therefore, this file shouldn't have any dependencies to
the rest of the valkey code.

---------

Signed-off-by: Viktor Söderqvist <[email protected]>
Improve the performance of crc64 for large batches by processing large
number of bytes in parallel and combining the results.

## Performance 
* 53-73% faster on Xeon 2670 v0 @ 2.6ghz
* 2-2.5x faster on Core i3 8130U @ 2.2 ghz
* 1.6-2.46 bytes/cycle on i3 8130U
* likely >2x faster than crcspeed on newer CPUs with more resources than
a 2012-era Xeon 2670
* crc64 combine function runs in <50 nanoseconds typical with vector +
cache optimizations (~8 *microseconds* without vector optimizations, ~80
*microseconds without cache, the combination is extra effective)
* still single-threaded
* valkey-server test crc64 --help (requires `make distclean && make
SERVER_TEST=yes`)

---------

Signed-off-by: Josiah Carlson <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
…of-rewrite (valkey-io#393)

Renamed redis to valkey/server in aof.c serverlogs.

The AOF rewrite child process title is set to "redis-aof-rewrite" if
Valkey was started from a redis-server symlink, otherwise to
"valkey-aof-rewrite".

This is a breaking changes since logs are changed.

Part of valkey-io#207.

---------

Signed-off-by: Shivshankar-Reddy <[email protected]>
This is a minor change where only naming and links now points properly
to valkey.

Fixes valkey-io#388

---------

Signed-off-by: Rolandas Šimkus <[email protected]>
Signed-off-by: simkusr <[email protected]>
Signed-off-by: simkusr <[email protected]>
Signed-off-by: Viktor Söderqvist <[email protected]>
Co-authored-by: simkusr <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]>
These JSON files were originally not intended to be used directly, since
they contain internals and some fiels like "acl_categories" that are not
the final ACL categories. (Valkey will apply some implicit rules to
compute the final ACL categories.) However, people see JSON files
and use them directly anyway.

So it's better to document them.

In a later PR, we can get rid of all implicit ACL categories and instead
populate them explicitly in the JSON files. Then, we'll add a validation
(e.g. in generate-command-code.py) that the implied categories are set.

---------

Signed-off-by: Viktor Söderqvist <[email protected]>
Co-authored-by: Binbin <[email protected]>
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tests left for review.

Comment on lines 2530 to 2535
/* We intentionally avoid updating myself's configEpoch when
* taking ownership of this slot. This approach is effective
* in scenarios where my primary crashed during the slot
* finalization process. I became the new primary without
* inheriting the slot ownership, while the source shard
* continued and relinquished the slot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel very uncomfortable about just breaking core assumptions about the algorithm here (specifically omitting the configEpoch). It really feels like we should be bumping the epoch here, even if it means we steal slots unnecessarily. (This is an edge case after all) We are explicitly prioritizing availability here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is equally incorrect, if not more, to bump the config epoch without going through consensus. The impact on availability is the same, bumping the epoch or not, because this node will assume the ownership of this slot. In this case, the more likely case will be what I explained in the comments that follow (copied below too)

                    * By not increasing myself's configEpoch, we ensure that
                    * if the slot is correctly migrated to another primary, I
                    * will not mistakenly claim ownership. Instead, any ownership
                    * conflicts will be resolved accurately based on configEpoch
                    * values. */

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is equally incorrect, if not more, to bump the config epoch without going through consensus.

This isn't true, there is nothing wrong with bumping the epoch without consensus. Consensus is the preferred way to do it to avoid unnecessary churn, but it's not a pre-requisite.

The impact on availability is the same, bumping the epoch or not, because this node will assume the ownership of this slot. In this case, the more likely case will be what I explained in the comments that follow (copied below too)

I agree, that is why I'm saying we should bump the epoch. There was a change to the slot ownership, that is supposed to come with an epoch bump.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true, there is nothing wrong with bumping the epoch without consensus. Consensus is the preferred way to do it to avoid unnecessary churn, but it's not a pre-requisite.

A primary case that I am trying to handle is where the node in question becomes a primary without knowing that the source shard has just given up its slot ownership to its old primary since the old primary just failed. For this node to become the new primary, it would have to go through an election and get its config epoch bumped already (with a consensus). However, it is also conceivable that the source shard might have given the slot to a different shard and it is in this case where I still think it is safer to not blindly bump the config epoch because it would undo the true indent; while in the primary failure case we don't need to bump the config epoch, because it is a side product of the failover.

Btw, in all existing cases where we bump the config epoch without consensus, we have a clear intent to take over the slot ownership and this is the key difference IMO.

Copy link
Member

@madolson madolson May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, in all existing cases where we bump the config epoch without consensus, we have a clear intent to take over the slot ownership and this is the key difference IMO.

You're prescribing intent to an algorithm. The epoch get's bumped every time the slot ownership is changed, because we are in a new-epoch with new state.

Copy link
Member

@madolson madolson May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it is safer to not blindly bump the config epoch because it would undo the true indent;

But you're okay with blindly serving data without true intent? I suppose that is my problem with this, either we are committing to the change with the epoch bump or we shouldn't be serving traffic.

src/cluster_legacy.c Outdated Show resolved Hide resolved
tests/unit/cluster/slot-migration.tcl Show resolved Hide resolved
tests/unit/cluster/slot-migration.tcl Outdated Show resolved Hide resolved
}

# restart a server and wait for it to come back online
proc restart_server_and_wait {server_id} {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost universally we de-schedule processes by using the pause helpers to trigger failovers. This does two things, one is that we don't have to restart the process manually later, and second is that it's similar to when a processes disappears for a bit so tests additional modes.

Is there a specific reason we are restarting here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost universally we de-schedule processes by using the pause helpers to trigger failovers.

is this pause_process? it kills the process and a restart is needed too. or this is something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just pauses the process, I was asking if we specifically need to kill it here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. my intent is to simulate true failures. If I don't kill the server, I will have to use "graceful failover" to switch roles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the server is paused (SIGPAUSE), it doesn't respond to cluster messages and this leads to an automatic failover, in the same way as if the network were broken. I think simulates a true failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the segfault seems to be causing more valgrind issues as well. I think we should re-evaluate this decision.

assert {$duration > 2000}

# Setslot should fail with not enough good replicas to write after the timeout
assert_equal {NOREPLICAS Not enough good replicas to write.} $e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's more weird behavior, but for now it seems acceptable.

@PingXie
Copy link
Member Author

PingXie commented May 3, 2024

Yeah, it's more weird behavior, but for now it seems acceptable.

I am not sure if I agree with the "more weird behavior" statement. There is actually a coherent story to tell/document here. The cluster setslot command is now pre-replicated to all replicas synchronously. If one or more replicas failed to acknowledge the command, returning "Not enough good replicas to write" seems very reasonable to me. I also don't want to introduce a specific error just for cluster setslot. Do you have a different error message in mind?

@PingXie
Copy link
Member Author

PingXie commented May 3, 2024

Yeah, it's more weird behavior, but for now it seems acceptable.

I am not sure if I agree with the "more weird behavior" statement. There is actually a coherent story to tell/document here. The cluster setslot command is now pre-replicated to all replicas synchronously. If one or more replicas failed to acknowledge the command, returning "Not enough good replicas to write" seems very reasonable to me. I also don't want to introduce a specific error just for cluster setslot. Do you have a different error message in mind?

Okay now I think this is "weird behavior" on the github side :). It apparently misplaced your comment and made it look like it was about the error message but after I clicked into the code, I saw that you were commenting about the "torn" states. Please ignore my response above.

image

@PingXie
Copy link
Member Author

PingXie commented May 7, 2024

Sorry I can't fix the DCO. Shouldn't have rebased. Closing this PR and moving changes to #445.

@PingXie PingXie closed this May 7, 2024
@PingXie PingXie deleted the slot_migration branch May 16, 2024 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. release-notes This issue should get a line item in the release notes
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Improve slot migration reliability