Maintain deterministic order of CLUSTER SHARDS response #411

VoletiRam · 2024-05-01T08:11:34Z

Maintain deterministic order of CLUSTER SHARDS response. Currently we don't maintain the shards/masters in sorted fashion and hence we get the order of CLUSTER SHARDS response non-deterministic on different nodes. Maintain the sorted Masters list of pointers, similar to replicas, and replace the current <shards, list[nodes]> dict which is not suitable for sorting. Add the TOPOLOGY argument to get the deterministic response which would remove the replication offset and node health status from cluster shards response. Sort the masters based on the node Id. Include the new CLUSTER SHARDS TOPOLOGY command in the cluster_config_consistent procedure to ensure thorough test coverage and conduct a sanity check on cluster consistency.

Example response of CLUSTER SHARDS TOPOLOGY in a 2 primaries 2 replicas cluster.

Response from Primary 1:

127.0.0.1:6320> cluster shards topology
1) 1) "slots"
   2) 1) (integer) 501
      2) (integer) 16383
   3) "nodes"
   4) 1)  1) "id"
          2) "879df438dd67bd56e717d8400b718c743c4355d1"
          3) "port"
          4) (integer) 6321
          5) "ip"
          6) "127.0.0.1"
          7) "endpoint"
          8) "127.0.0.1"
          9) "role"
         10) "master"
      2)  1) "id"
          2) "4c04d2a090de3e7c9a375ad0ee2e25b52c31c310"
          3) "port"
          4) (integer) 6323
          5) "ip"
          6) "127.0.0.1"
          7) "endpoint"
          8) "127.0.0.1"
          9) "role"
         10) "replica"
2) 1) "slots"
   2) 1) (integer) 0
      2) (integer) 500
   3) "nodes"
   4) 1)  1) "id"
          2) "af1e6a659a85779864456efe6a323fa7c7eda187"
          3) "port"
          4) (integer) 6320
          5) "ip"
          6) "127.0.0.1"
          7) "endpoint"
          8) "127.0.0.1"
          9) "role"
         10) "master"
      2)  1) "id"
          2) "5f4de2e73e92b67c0d4adb0f5737776f5d14e137"
          3) "port"
          4) (integer) 6322
          5) "ip"
          6) "127.0.0.1"
          7) "endpoint"
          8) "127.0.0.1"
          9) "role"
         10) "replica"

Response from Primary 2:

127.0.0.1:6321> cluster shards topology
1) 1) "slots"
   2) 1) (integer) 501
      2) (integer) 16383
   3) "nodes"
   4) 1)  1) "id"
          2) "879df438dd67bd56e717d8400b718c743c4355d1"
          3) "port"
          4) (integer) 6321
          5) "ip"
          6) "127.0.0.1"
          7) "endpoint"
          8) "127.0.0.1"
          9) "role"
         10) "master"
      2)  1) "id"
          2) "4c04d2a090de3e7c9a375ad0ee2e25b52c31c310"
          3) "port"
          4) (integer) 6323
          5) "ip"
          6) "127.0.0.1"
          7) "endpoint"
          8) "127.0.0.1"
          9) "role"
         10) "replica"
2) 1) "slots"
   2) 1) (integer) 0
      2) (integer) 500
   3) "nodes"
   4) 1)  1) "id"
          2) "af1e6a659a85779864456efe6a323fa7c7eda187"
          3) "port"
          4) (integer) 6320
          5) "ip"
          6) "127.0.0.1"
          7) "endpoint"
          8) "127.0.0.1"
          9) "role"
         10) "master"
      2)  1) "id"
          2) "5f4de2e73e92b67c0d4adb0f5737776f5d14e137"
          3) "port"
          4) (integer) 6322
          5) "ip"
          6) "127.0.0.1"
          7) "endpoint"
          8) "127.0.0.1"
          9) "role"
         10) "replica"

Ref: #114

codecov · 2024-05-01T08:23:17Z

Codecov Report

Attention: Patch coverage is 93.97590% with 10 lines in your changes missing coverage. Please review.

Project coverage is 70.28%. Comparing base (f2bbd1f) to head (d2303cf).
Report is 92 commits behind head on unstable.

Files	Patch %	Lines
src/cluster_legacy.c	93.97%	10 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #411      +/-   ##
============================================
- Coverage     70.30%   70.28%   -0.03%     
============================================
  Files           111      111              
  Lines         60300    60285      -15     
============================================
- Hits          42393    42370      -23     
- Misses        17907    17915       +8

Files	Coverage Δ
src/cluster_legacy.c	`86.05% <93.97%> (+0.19%)`	⬆️

... and 11 files with indirect coverage changes

hpatro · 2024-05-01T17:49:53Z

Thanks @VoletiRam for the PR.

There were discussion around using CANONICAL / DETERMINISTIC terminology for the filter. I think both of those are confusing from user perspective who is not bothered about the order. Rather I feel TOPOLOGY is more easy to understand where we provide a subset of information regarding the cluster topology. We also initially wanted the command to be named as CLUSTER TOPOLOGY redis/redis#10168.

@valkey-io/core-team Please take a look.

hpatro · 2024-05-01T17:53:09Z

src/cluster_legacy.c

 list *clusterGetNodesInMyShard(clusterNode *node) {
-    sds s = sdsnewlen(node->shard_id, CLUSTER_NAMELEN);
-    dictEntry *de = dictFind(server.cluster->shards,s);
-    sdsfree(s);
-    return (de != NULL) ? dictGetVal(de) : NULL;
+    clusterNode *master = clusterNodeGetMaster(node);
+
+    list *l = listCreate();
+    listAddNodeTail(l, master);
+    for (int i = 0; i < master->numslaves; i++) {
+        listAddNodeTail(l, master->slaves[i]);
+    }
+    return l;
 }


This operation becomes O(N) and some additional memory allocation. However, I'm not bothered a lot about it since the code flow is not on the hot path.

hwware · 2024-05-01T19:35:09Z

Just want to confirm with you @VoletiRam If addling the new parameter "topology" for cluster shards goal is that from every client's view, the output of the 2 masters and 2 replicas nodes is always:

127.0.0.1:6321> cluster shards topology

1. "slots"
2. 1. (integer) 501
  2. (integer) 16383
3. "nodes"
4. 1. 1. "id"
    2. "879df438dd67bd56e717d8400b718c743c4355d1"
    3. "port"
    4. (integer) 6321
      .....
  2. 1. "id"
    2. "4c04d2a090de3e7c9a375ad0ee2e25b52c31c310"
    3. "port"
    4. (integer) 6323
      ......
1. "slots"
2. 1. (integer) 0
  2. (integer) 500
  3. "nodes"
  4. 1. 1. "id"
    2. "af1e6a659a85779864456efe6a323fa7c7eda187"
    3. "port"
    4. (integer) 6320
  5. 1. "id"
    2. "5f4de2e73e92b67c0d4adb0f5737776f5d14e137"
    3. "port"
    4. (integer) 6322

BWT, I am reviewing this PR codes， Thanks

madolson · 2024-05-01T20:03:56Z

src/cluster_legacy.c

@@ -63,6 +63,8 @@ void clusterUpdateState(void);
 int clusterNodeCoversSlot(clusterNode *n, int slot);
 list *clusterGetNodesInMyShard(clusterNode *node);
 int clusterNodeAddSlave(clusterNode *master, clusterNode *slave);
+int clusterNodeAddMaster(clusterNode *master);


I actually don't like this abstraction. I would like it to be conceptually possible for a shard to exist without a master. @PingXie Thoughts?

Could you explain a bit more on how such situation can arise?

I prefer updating this function name as int clusterNodeAddToMasters(clusterNode *node);

I would like it to be conceptually possible for a shard to exist without a master.

This abstraction seems fine from a quick look. Basically, the contract is to keep track of all the primaries in the cluster in a deterministic order. Even if a primary-less shard could exist, the contract would be upheld regardless I think.

This also reminds me that I will need to create a PR to remove all references in code to "m/s".

I will take a closer look at this PR after merging #245

Thank you for taking a look. Will wait until you review the PR.

hpatro · 2024-05-01T20:28:31Z

@hwware The ordering is based on primary node id lexicographically irrespective of the topology filtering. topology filtering helps with removing the state of replication offset/health which can vary across node(s).

VoletiRam · 2024-05-02T01:20:41Z

Thank you @hwware. As @hpatro pointed, the view will be same for both CLUSTER SHARDS and CLUSTER SHARDS TOPOLOGY from every client. The TOPOLOGY option will exclude replication offset and node health status from the shards response. This request aims to eliminate differing fields, facilitating straightforward comparison of the shards response across cluster nodes via simple string matching, thus avoiding the need for parsing.

zuiderkwast · 2024-05-02T08:02:06Z

@PingXie This PR has large changes in cluster_legacy.c. I think we should merge your PR #245 before this one. Please check if you think everything will be different after merging your PR.

barshaul · 2024-05-02T10:39:31Z

src/cluster_legacy.c

-void addShardReplyForClusterShards(client *c, list *nodes) {
-    serverAssert(listLength(nodes) > 0);
-    clusterNode *n = listNodeValue(listFirst(nodes));
+void addShardReplyForClusterShards(client *c, clusterNode* n, int topology) {


minor: clusterNode* n -> clusterNode* primary to clarify that this function expects to receive a primary node

barshaul · 2024-05-02T10:48:59Z

@hpatro @VoletiRam

Does it filter out failed and loading nodes?

It would be ideal if the client's topology map could be solely based on the results of this command, eliminating the need for subsequent checks on the nodes' status.

hwware · 2024-05-02T14:02:02Z

@hwware The ordering is based on primary node id lexicographically irrespective of the topology filtering. topology filtering helps with removing the state of replication offset/health which can vary across node(s).

Thanks for your words. Then according to this rule, sorted by primary node id lexicographically, all clients should get the same view from any node. (At least primary output is same.)

VoletiRam · 2024-05-02T14:28:45Z

@barshaul We are only filtering out fields that contribute to non-deterministic output but not the node's information based on their health status. I think the ask was to eliminate volatile fields that can vary across the clients, at least not clear from discussion in #114. We can help filter out node's information if everyone agrees.

hwware · 2024-05-02T14:19:10Z

src/cluster.c

+                (c->argc == 2 || c->argc == 3))
+    {
+        /* CLUSTER SHARDS [TOPOLOGY] */
+        int topology = 1;


Usually, I set this kind of bool variable default value as 0 (here the variable more close to a bool variable).
But it is not a big issue I think

hwware · 2024-05-02T14:21:30Z

src/commands/cluster-shards.json

@@ -4,15 +4,20 @@
        "complexity": "O(N) where N is the total number of cluster nodes",
        "group": "cluster",
        "since": "7.0.0",
-        "arity": 2,
+        "arity": -2,
        "container": "CLUSTER",
        "function": "clusterCommand",
        "command_flags": [


Because you add one more argument here, you need add one item "history" here, for example

"history": [
[
"2.8.0",
"Added the -2 reply."
]
],

hwware · 2024-05-02T14:48:03Z

src/cluster_legacy.c

-void clusterCommandShards(client *c) {
-    addReplyArrayLen(c, dictSize(server.cluster->shards));
+void clusterCommandShards(client *c, int topology) {
+    serverAssert(server.cluster->nummasters > 0);


I do not think server.cluster->nummasters > 0 checking is must. 0 should work, how do you think?

Thank you very much for reviewing the changes. Will address the suggestions once everyone finishes reviewing the PR.

hpatro · 2024-05-08T22:12:51Z

@PingXie @madolson @barshaul

Few questions which we need to reach consensus on:

With this filtering option introduced, does CLUSTER SHARDS command have any significance in itself?
@barshaul Regarding removing replica(s) from response which are not ready to serve any data, would it be fine to keep the health data? I presume health information doesn't change that often compared to replication-offset.
Should we filter out primaries not serving any slot?

VoletiRam · 2024-05-09T05:32:45Z

Thank you @hpatro for raising the questions that need consensus. I want to add couple of questions as well.

I am checking few scenarios with 2 primaries - 2 replicas in a 4 node cluster with slot coverage on primaries.

If a primary is failed, the replica will never failover as there are not enough votes to win the election (need atleast 2 votes and there are not enough primaries with slot coverage in the cluster that are responsive). The quorum would never reach to mark the primary as failed. Except CLUSTER NODES, nothing indicates about failed/disconnected primary. How do we want the topology response in this scenario?
Let's assume replica failed over a primary. If old primary is still part of cluster but not responsive/failed (which means the role is still primary and marked as fail with no slot coverage), currently in CLUSTER SHARDS response, we show two primaries (old is marked fail and new one is online) but slots coverage will show as empty, which is inaccurate. How do we want the topology response in this scenario?

With my implementation, the slot coverage empty issue in ##5 can be solved as we go over each master from masters list and print corresponding slots, but it will still show old master in the response with empty slots and fail health status unless we decide to filter out either master node if marked fail or master with no slot coverage. Please share your opinion.

PingXie · 2024-05-09T08:15:02Z

With this filtering option introduced, does CLUSTER SHARDS command have any significance in itself?

Not sure if I understand your question. Can you elaborate?

@barshaul Regarding removing replica(s) from response which are not ready to serve any data, would it be fine to keep the health data? I presume health information doesn't change that often compared to replication-offset.

Based on the use case as described in #114 , including health makes sense to me as it is an important factor to consider for the "routing" decision.

Should we filter out primaries not serving any slot?
I would say yes since the uber use case is to facilitate routing decisions in the client.

I am checking few scenarios with 2 primaries - 2 replicas in a 4 node cluster with slot coverage on primaries.

I don't think 2-shard deployments are legit given the current design/implementation. We need to officially support 2-shard clusters first and then it makes sense to discuss the output of cluster shards in this case.

@zuiderkwast should we resurrect the "voting replicas" discussion? redis/redis#12390

hpatro · 2024-05-09T17:07:38Z

With this filtering option introduced, does CLUSTER SHARDS command have any significance in itself?

Not sure if I understand your question. Can you elaborate?

If all clients would prefer using CLUSTER SHARDS TOPOLOGY command, what will be the utility of CLUSTER SHARDS command ? Few alternatives I can think of apart from the above solution:

Deprecate CLUSTER SHARDS command and introduce CLUSTER TOPOLOGY command with the trimmed down version.
Make a breaking change in Valkey 8 and update CLUSTER SHARDS command itself and remove replication-offset from it (I'm not aware of the client adoption and version handling).

hwware · 2024-05-09T19:22:12Z

With this filtering option introduced, does CLUSTER SHARDS command have any significance in itself?

Not sure if I understand your question. Can you elaborate?

If all clients would prefer using CLUSTER SHARDS TOPOLOGY command, what will be the utility of CLUSTER SHARDS command ? Few alternatives I can think of apart from the above solution:

Deprecate CLUSTER SHARDS command and introduce CLUSTER TOPOLOGY command with the trimmed down version.

Make a breaking change in Valkey 8 and update CLUSTER SHARDS command itself and remove replication-offset from it (I'm not aware of the client adoption and version handling).

I do not think we should deprecate CLUSTER SHARDS command. Client need to remember one more command and
the worse case is to update the client side code as well.

Thus my suggestion is:
If client just call CLUSTER SHARDS, the response should be the same behavior as now.
If client call CLUSTER SHARDS TOPOLOGY, the result should be from this PR.

hwware · 2024-05-09T19:31:36Z

Thank you @hpatro for raising the questions that need consensus. I want to add couple of questions as well.

I am checking few scenarios with 2 primaries - 2 replicas in a 4 node cluster with slot coverage on primaries.

If a primary is failed, the replica will never failover as there are not enough votes to win the election (need atleast 2 votes and there are not enough primaries with slot coverage in the cluster that are responsive). The quorum would never reach to mark the primary as failed. Except CLUSTER NODES, nothing indicates about failed/disconnected primary. How do we want the topology response in this scenario?

Let's assume replica failed over a primary. If old primary is still part of cluster but not responsive/failed (which means the role is still primary and marked as fail with no slot coverage), currently in CLUSTER SHARDS response, we show two primaries (old is marked fail and new one is online) but slots coverage will show as empty, which is inaccurate. How do we want the topology response in this scenario?

With my implementation, the slot coverage empty issue in ##5 can be solved as we go over each master from masters list and print corresponding slots, but it will still show old master in the response with empty slots and fail health status unless we decide to filter out either master node if marked fail or master with no slot coverage. Please share your opinion.

In fact. 2 primaries - 2 replicas cluster, 2 primaries - 0 replicas cluster, 2 primaries - 4 replicas cluster are totally different.

Case 1: 2 primaries - 0 replicas cluster. If client set cluster-require-full-coverage as no in conf file, cluster still work even one primary node fail.

Case 2: 2 primaries - 2 replicas cluster: If any primary fails, no vote happens, and replica can failover immediately

Case 3: 2 primaries - 4 replicas cluster: vote happen if any primary fails

So i agree with Ping, let us first support 2-shard clusters first then discuss the output of cluster shards [topology]

zuiderkwast · 2024-05-10T10:26:57Z

I don't think 2-shard deployments are legit given the current design/implementation. We need to officially support 2-shard clusters first and then it makes sense to discuss the output of cluster shards in this case.

@zuiderkwast should we resurrect the "voting replicas" discussion?

@PingXie Whether replicas can vote or whether the cluster has quorum to perform failovers, or even what kind of consensus algorithm is used, should be irrelevant to the clients. (It's even possible to have some external watchdog that performs manual failover.) So let's decouple those discussions from this PR?

Based on the use case as described in #114 , including health makes sense to me as it is an important factor to consider for the "routing" decision.

I don't think clients should make their own decisions about the health of nodes. That's something the cluster does for them. The clients should only be concerned with routing according to what the cluster tells them. For this, there's no need to include shards without slots. Maybe it's better to exclude them, because such nodes are usually going to be taken down or are just being set up and not really ready to be used for pubsub and other stuff clients may want to send to them.

To summarize: I think CLUSTER SHARDS TOPOLOGY should return no more info that what's included in CLUSTER SLOTS. (Just on a different format.)

I do not think we should deprecate CLUSTER SHARDS command. Client need to remember one more command and the worse case is to update the client side code as well.

Thus my suggestion is:
If client just call CLUSTER SHARDS, the response should be the same behavior as now.
If client call CLUSTER SHARDS TOPOLOGY, the result should be from this PR.

I agree with @hwware about this. If clients have started using CLUSTER SHARDS, we can let them do that. Let's not break it.

madolson · 2024-05-10T17:38:41Z

To summarize: I think CLUSTER SHARDS TOPOLOGY should return no more info that what's included in CLUSTER SLOTS. (Just on a different format.)

If we accept this premise, I think we should consider that maybe we are trying to force CLUSTER SHARDS to be something it isn't. CLUSTER SHARDS was supposed to solve three things:

Be more densely encoded for large highly fragmented clusters. I.e., where slots are distributed in such a way that they don't form continuous ranges. This causes cluster slots to print a lot of duplicate information.
Be a more readable and extensible version of cluster nodes for clients that want more information (like health and offset) but out of the text based format of CLUSTER NODES. It was also supposed to be for humans to see the status of a cluster.
Unwind some poor choices we made historically about "preferred" endpoints and make it easier for clients to see the "whole" picture about networking. I don't think SHARDS has really been that successful here.

It seems like we are saying that clients just shouldn't care about all the extra data provided by CLUSTER NODES. In that world, why wouldn't we do something like CLUSTER SLOTS COMPACT that attempts to help CLUSTER SLOTS solve problem 1. We could change the return type from a start and end, to just an array of start/end ranges. Then clients can more easily adopt the new code? In respect to 3, I guess clients already have that baked in, so do we need to fix it?

The asks from @barshaul are basically, "I don't want any more information, I just want to know what slots are healthy and able to be served from". That is what CLUSTER SLOTS does.

So, we can make CLUSTER SHARDS more deterministic (in ordering), it's still a nice property for doing diffs. But I'm not sure I'm convinced anymore that CLUSTER SHARDS TOPOLOGY is a good idea.

zuiderkwast · 2024-05-10T19:43:53Z

Yes (1) was what I meant, but I wasn't completely aware of the background and details.

It seems like the main point of this new CLUSTER SHARDS variant is that it's deterministic, so a you (or a test case) can check that the nodes' views of the cluster is consistent. This isn't the use case for client slot routing. It's rather a use case for test cases and for admins, to check that the cluster converges after adding/removing nodes, slot migrations, etc. If it's deterministic for a healthy cluster even with health info included, then I'm not going to argue against it.

It can be used by clients too, just to save some bytes, but if some clients feels they want more info, they'll just use the full version of the command, or CLUSTER NODES. That can't be helped.

So I guess the question should be: How common or important is it for cluster admins to check that a cluster converges in this way?

(In our own test framework we can solve it in some other way if it's just for us.)

madolson · 2024-05-12T23:48:11Z

So I guess the question should be: How common or important is it for cluster admins to check that a cluster converges in this way?

I've done a fair amount of "diff" between various cluster outputs, and usually have to do some pre-processing to make sure they agree. It would be nice if the node ordering was the same in that case. You could then trivially ignore the fields that are known to be slightly different (replication offset).

madolson · 2024-05-15T15:05:45Z

To make my suggestion about cluster slots more concrete, I'm proposing a change so that the response of cluster slots becomes:

> CLUSTER SLOTS
1) 1) 1) (integer) 0  -- Start of range 1
      2) (integer) 10000 -- Start of range 2
   2) 1) (integer) 5460 -- End of range 1
      2) (integer) 12000 -- End of range 2
   3) 1) "127.0.0.1"
      2) (integer) 30001
      3) "09dbe9720cda62f7865eabc5fd8857c5d2678366"
      4) 1) hostname
         2) "host-1.valkey.example.com"
   4) 1) "127.0.0.1"
      2) (integer) 30004
      3) "821d8ca00d7ccf931ed3ffc7e3db0599d2271abf"
      4) 1) hostname
         2) "host-2.valkey.example.com"
2) 1) 1) (integer) 5461 -- Start of range 1
      2) (integer) 12001 -- Start of range 2
   2) 1) (integer) 9999 -- End of range 1
      2) (integer) 16383 -- End of range 2
   3) 1) "127.0.0.1"
      2) (integer) 30002
      3) "c9d93d9f2c0c524ff34cc11838c2003d8c29e013"
      4) 1) hostname
         2) "host-3.valkey.example.com"
   4) 1) "127.0.0.1"
      2) (integer) 30005
      3) "faadb3eb99009de4ab72ad6b6ed87634c7ee410f"
      4) 1) hostname
         2) "host-4.valkey.example.com"

Besides that, it behaves the exact same cluster CLUSTER SLOTS. Most likely by adding an argument like CLUSTER SLOTS PACKED or something.

zuiderkwast · 2024-05-15T16:33:41Z

@madolson I don't think the reason clients haven't adopted CLUSTER SHARDS (added in 7.0) is that it's hard to parse. The reason is rather that clients want to be backward compatible and support old Redis versions. If we add CLUSTER SLOTS PACKED, it will have the same problem: Clients can only use it if they know the server supports it, and then they still need a fallback for versions that don't.

Once Redis 6.2 and all Redis 6-compatible services are EOL (or about the time of Valkey 9 is released), then all deployments support CLUSTER SHARDS and then we can start expecting clients to switch to CLUSTER SHARDS.

madolson · 2024-05-15T17:35:54Z

I don't think the reason clients haven't adopted CLUSTER SHARDS (added in 7.0) is that it's hard to parse. The reason is rather that clients want to be backward compatible and support old Redis versions. If we add CLUSTER SLOTS PACKED, it will have the same problem: Clients can only use it if they know the server supports it, and then they still need a fallback for versions that don't.

I agree! People don't want to use the CLUSTER SHARDS we implemented, for reasons that Bar mentioned, so if we were to introduce a net new variant CLUSTER SHARDS TOPOLOGY, we are in the same situation of asking clients to move again. I'm saying that maybe CLUSTER SHARDS as a client command was a mistake, the way it was structured was almost entirely to make it more likely to be adopted by a single client, lettuce, since they wanted weird extra information. What I got from this thread though, is we think clients shouldn't be collecting more information, and should just rely on the flat response from CLUSTER SLOTS.

I suppose there is another option. If we implement a client capability functionality, we could make a change that allows clients to "opt-in" to the new format I proposed. If at startup, they send CLIENT CAPA PACKED-SLOTS, then we can respond with packed response type. Clients can check during parse if it's the "integer" or "array" encoded responses, and handle it accordingly.

Once Redis 6.2 and all Redis 6-compatible services are EOL (or about the time of Valkey 9 is released), then all deployments support CLUSTER SHARDS and then we can start expecting clients to switch to CLUSTER SHARDS.

I don't agree this will happen. Lots of people will continue to use old versions because they will be supported.

zuiderkwast · 2024-05-15T20:29:12Z

Regarding this PR: Can we just settle with sorting what can be sorted in CLUSTER SHARDS? No new argument. That's my vote. Then we document what needs to be ignored when comparing the result from two different nodes. That means a doc PR.

madolson · 2024-05-15T21:16:07Z

@VoletiRam @hpatro Can you review what Victor posted in the previous message. Instead of adding a new new command, let's just make the existing version deterministically ordered but not make any changes to arguments.

madolson · 2024-05-19T18:14:25Z

src/cluster_legacy.c

 }

 /* Add to the output buffer of the given client, an array of slot (start, end)
 * pair owned by the shard, also the primary and set of replica(s) along with
 * information about each node. */
 void clusterCommandShards(client *c) {
-    addReplyArrayLen(c, dictSize(server.cluster->shards));
+    addReplyArrayLen(c, server.cluster->nummasters);


This change becomes much simpler if we replace the dict with a RAX, instead of maintaining this list. It also preserves the structure we built up, which is the Cluster->Shards->nodes mapping.

I will explain the rationale to decide to maintain the pointers to masters instead of shards->list-of-nodes.

We initially thought of replacing the Dict with RAX, but that only helps solve one problem. We can maintain sorted shardIDs with RAX, but we still need the corresponding list of to be sorted if we want to uphold the shards->nodes mapping; otherwise, the list of is useless. We are currently using shards->nodes only for the CLUSTER SHARDS command and not anywhere else. Having RAX, list, and clusterNode overhead to maintain this list just for one command doesn't seem right when it doesn't come with significant performance or memory benefits over an array of pointers to masters. We felt that maintaining the array of pointers to masters is a reasonable tradeoff. Technically, we still honored the shardId->master contract through pointers to masters.

We are currently using shards->nodes only for the CLUSTER SHARDS command and not anywhere else. Having RAX, list, and clusterNode overhead to maintain this list just for one command doesn't seem right when it doesn't come with significant performance or memory benefits over an array of pointers to masters. We felt that maintaining the array of pointers to masters is a reasonable tradeoff. Technically, we still honored the shardId->master contract through pointers to masters.

The thinking was that we would like to use the SHARD ID more in the future as an O(1) lookup, as it's the more logical identifier as opposed to just the Primary ID.

madolson · 2024-05-20T02:06:24Z

src/commands/cluster-shards.json

-        "command_tips": [
-            "NONDETERMINISTIC_OUTPUT"
-        ],
        "reply_schema": {


The replication offset is still effectively non-determinsitic, so I think we nee to leave this flag. It's not used within the engine to determine anything, so it should be OK to leave.

madolson · 2024-05-20T02:08:23Z

tests/cluster/tests/28-cluster-shards.tcl

@@ -285,3 +285,24 @@ test "CLUSTER MYSHARDID reports same shard id after cluster restart" {
        assert_equal [dict get $node_ids $i] [R $i cluster myshardid]
    }
 }
+
+test "Deterministic order of CLUSTER SHARDS response" {


This test isn't part of the CI since it uses the legacy clustering test framework. We should either move cluster-shards to the new framework or just move this part of the file over and we can remove the rest of the file in a separate PR.

hpatro · 2024-05-20T17:23:33Z

@VoletiRam One of the thing which came up while discussing with @madolson we could sort the cluster->nodes (maybe use rax?) and pull out primaries from that. With that we could avoid these additional pointers. Could you check if that's feasible?

bbarani · 2024-06-25T00:36:46Z

@VoletiRam One of the thing which came up while discussing with @madolson we could sort the cluster->nodes (maybe use rax?) and pull out primaries from that. With that we could avoid these additional pointers. Could you check if that's feasible?

@VoletiRam do you have any updates here?

Maintain deterministic order of CLUSTER SHARDS response. Currently we don't maintain the shards/masters in sorted fashion and hence we get the order of CLUSTER SHARDS response non-deterministic on different nodes. Maintain the sorted Masters list of pointers, similar to replicas, and get rid of <shards, list<nodes>> dict which is not suitable for sorting. Add TOPOLOGY argument to get the deterministic response which would remove the replication offset and node health status from cluster shards response. Sort the nodes based on the node Id. Use it in proc `cluster_config_consistent` for the test coverage and sanity purpose. Signed-off-by: Ram Prasad Voleti <[email protected]>

Remove topology argument and cleanup related code changes. Signed-off-by: Ram Prasad Voleti <[email protected]>

Replace Dict with Rax for Cluster Nodes and construct primaries list on the go, instead of maintaining shards/masters list. Signed-off-by: Ram Prasad Voleti <[email protected]>

VoletiRam · 2024-07-05T08:41:50Z

Sorry for the delayed response. Was busy with the other commitments at work. I addressed the comments. I replaced the Dict data structure with Rax for cluster->nodes, and constructed the list of primaries from it when the 'CLUSTER SHARDS' command is requested.

Please review whenever you get a chance @hpatro @madolson

hpatro · 2024-08-08T16:54:21Z

@madolson we would still want to improve CLUSTER SHARDS, right? I think we should. Would like to understand @valkey-io/core-team stance before diving deep into the PR.

enjoy-binbin · 2024-08-09T03:12:42Z

yes, i am in for the CLUSTER SHARDS improve thing.

hpatro

Overall LGTM. There are plenty of touchpoints (iteration over rax) but the idea is to replace dict with rax to maintain the cluster nodes information to get primaries/replicas in lexicographical ordering.

hpatro · 2024-08-09T20:56:24Z

src/cluster_legacy.c

+    list *primaries = clusterGetPrimaries();
+    addReplyArrayLen(c, listLength(primaries));
+    listIter li;
+    listRewind(primaries, &li);
+    for (listNode *ln = listNext(&li); ln != NULL; ln = listNext(&li)) {
+        clusterNode *n = listNodeValue(ln);
+        addShardReplyForClusterShards(c, n);
    }
-    dictReleaseIterator(di);
+    listRelease(primaries);


This is the crux of the change. Here we would get primaries in lexicographical ordering due to underlying RAX structure.

hpatro · 2024-08-09T20:57:41Z

src/rax.h

@@ -95,6 +95,7 @@
 */

 #define RAX_NODE_MAX_SIZE ((1 << 29) - 1)
+#define RAX_OK 1


This is not used from RAX related method's return statements ?

hpatro · 2024-08-09T21:50:29Z

@PingXie Could you also take a look at this? It removes one of the abstraction you had introduced of shards -> nodes.

madolson · 2024-08-09T21:55:23Z

@madolson we would still want to improve CLUSTER SHARDS, right?

Yeah. I'm still fine improving this. I wasn't sure I was happy with removing the shards abstraction though from the code internals since we intend to add it in the migrate slots command.

hpatro · 2024-08-09T22:28:10Z

I wasn't sure I was happy with removing the shards abstraction though from the code internals since we intend to add it in the migrate slots command.

We could maybe still keep the abstraction with the small overhead we were already paying.

PingXie · 2024-08-12T02:15:26Z

@PingXie Could you also take a look at this? It removes one of the abstraction you had introduced of shards -> nodes.

I like the high level idea of single-sourcing the shard membership management.

However, I have a few questions regarding the impact of switching from a dictionary to a Rax:

I believe Rax may result in higher cache misses during lookups. This should not be an issue on smaller clusters but could be an issue for large clusters with 100s of nodes where we can see higher cluster traffic. Has any performance evaluation been done on the changes?
Additionally, the effect on random node selection during gossiping is worth further consideration. The overhead will also increase, and I’m curious whether this change might also affect the distribution of random node selections.

hpatro · 2024-08-16T20:41:47Z

I believe Rax may result in higher cache misses during lookups. This should not be an issue on smaller clusters but could be an issue for large clusters with 100s of nodes where we can see higher cluster traffic. Has any performance evaluation been done on the changes?

Additionally, the effect on random node selection during gossiping is worth further consideration. The overhead will also increase, and I’m curious whether this change might also affect the distribution of random node selections.

These are good callouts but might be difficult to measure. @PingXie Do you have any suggestion/scenario(s) in mind to reproduce? Will be helpful for @VoletiRam.

PingXie · 2024-08-21T01:08:50Z

I believe Rax may result in higher cache misses during lookups. This should not be an issue on smaller clusters but could be an issue for large clusters with 100s of nodes where we can see higher cluster traffic. Has any performance evaluation been done on the changes?

Additionally, the effect on random node selection during gossiping is worth further consideration. The overhead will also increase, and I’m curious whether this change might also affect the distribution of random node selections.

These are good callouts but might be difficult to measure. @PingXie Do you have any suggestion/scenario(s) in mind to reproduce? Will be helpful for @VoletiRam.

can we decouple the two changes: single sourcing the shard membership management and switching to Rax? I think we need some time to better understand the impact of Rax but we could benefit from the single sourcing change sooner.

For the performance impact analysis of Rax, I'm thinking about writing a small program that constructs a Rax tree with 1000 cluster nodes and performs queries on it. To simulate real-world conditions, we'll periodically flush the CPU cache by writing a large amount of data to a 32 MB memory block. We'll repeat the same process for a hash table-based implementation. Afterward, we can compare the aggregated lookup times, excluding the memory copy time.

For the distribution analysis, we can take a similar approach by having the program log its random node selections. We can then generate charts to compare the distribution patterns between the Rax-based and dictionary-based implementations.

Thoughts?

madolson · 2024-08-26T15:07:52Z

For now, the conclusion is we are okay leaving CLUSTER SHARDS as not deterministic. We are going to wait for the cached cluster slots response to see if it solves the performance issue folks have been seeing. If we still see issues, we will investigate the CLUSTER SLOTS DENSE.

hpatro approved these changes May 1, 2024

View reviewed changes

madolson reviewed May 1, 2024

View reviewed changes

zuiderkwast requested a review from PingXie May 2, 2024 08:02

barshaul reviewed May 2, 2024

View reviewed changes

hwware reviewed May 2, 2024

View reviewed changes

madolson added the major-decision-pending Major decision pending by TSC team label May 15, 2024

madolson reviewed May 19, 2024

View reviewed changes

madolson mentioned this pull request May 19, 2024

[NEW] Compact variant CLUSTER SLOTS DENSE #517

Open

madolson reviewed May 20, 2024

View reviewed changes

Ram Prasad Voleti added 2 commits July 5, 2024 03:35

Remove topology argument and cleanup related code changes

622a1bf

Remove topology argument and cleanup related code changes. Signed-off-by: Ram Prasad Voleti <[email protected]>

VoletiRam force-pushed the deterministic_shards branch 5 times, most recently from 45b7926 to 20d8225 Compare July 5, 2024 08:20

Replace Dict with Rax for Cluster Nodes

d2303cf

Replace Dict with Rax for Cluster Nodes and construct primaries list on the go, instead of maintaining shards/masters list. Signed-off-by: Ram Prasad Voleti <[email protected]>

VoletiRam force-pushed the deterministic_shards branch from 20d8225 to d2303cf Compare July 5, 2024 08:25

barshaul mentioned this pull request Aug 5, 2024

[NEW] Deterministic CLUSTER SHARDS Command #114

Open

hpatro self-requested a review August 9, 2024 20:51

hpatro reviewed Aug 9, 2024

View reviewed changes

zuiderkwast linked an issue Aug 12, 2024 that may be closed by this pull request

[NEW] Deterministic CLUSTER SHARDS Command #114

Open

madolson closed this Aug 26, 2024

Maintain deterministic order of CLUSTER SHARDS response #411

Maintain deterministic order of CLUSTER SHARDS response #411

Conversation

VoletiRam commented May 1, 2024

codecov bot commented May 1, 2024 • edited Loading

Codecov Report

hpatro commented May 1, 2024

Choose a reason for hiding this comment

hwware commented May 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hpatro commented May 1, 2024

VoletiRam commented May 2, 2024

zuiderkwast commented May 2, 2024

barshaul May 2, 2024 • edited Loading

Choose a reason for hiding this comment

barshaul commented May 2, 2024

hwware commented May 2, 2024 • edited Loading

VoletiRam commented May 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hpatro commented May 8, 2024

VoletiRam commented May 9, 2024 • edited Loading

PingXie commented May 9, 2024

hpatro commented May 9, 2024

hwware commented May 9, 2024

hwware commented May 9, 2024

zuiderkwast commented May 10, 2024 • edited Loading

madolson commented May 10, 2024 • edited Loading

zuiderkwast commented May 10, 2024

madolson commented May 12, 2024 • edited Loading

madolson commented May 15, 2024 • edited Loading

zuiderkwast commented May 15, 2024

madolson commented May 15, 2024

zuiderkwast commented May 15, 2024

madolson commented May 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hpatro commented May 20, 2024

bbarani commented Jun 25, 2024

VoletiRam commented Jul 5, 2024

hpatro commented Aug 8, 2024

enjoy-binbin commented Aug 9, 2024

hpatro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hpatro commented Aug 9, 2024

madolson commented Aug 9, 2024 • edited Loading

hpatro commented Aug 9, 2024

PingXie commented Aug 12, 2024

hpatro commented Aug 16, 2024

PingXie commented Aug 21, 2024

madolson commented Aug 26, 2024

codecov bot commented May 1, 2024 •

edited

Loading

barshaul May 2, 2024 •

edited

Loading

hwware commented May 2, 2024 •

edited

Loading

VoletiRam commented May 2, 2024 •

edited

Loading

VoletiRam commented May 9, 2024 •

edited

Loading

zuiderkwast commented May 10, 2024 •

edited

Loading

madolson commented May 10, 2024 •

edited

Loading

madolson commented May 12, 2024 •

edited

Loading

madolson commented May 15, 2024 •

edited

Loading

madolson commented Aug 9, 2024 •

edited

Loading