Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] Deterministic CLUSTER SHARDS Command #114

Open
barshaul opened this issue Apr 1, 2024 · 17 comments
Open

[NEW] Deterministic CLUSTER SHARDS Command #114

barshaul opened this issue Apr 1, 2024 · 17 comments
Assignees
Labels
major-decision-pending Major decision pending by TSC team

Comments

@barshaul
Copy link
Contributor

barshaul commented Apr 1, 2024

The problem/use-case that the feature addresses

Currently, some clients rely on a single-node perspective to obtain cluster topology, which may not accurately reflect the entire cluster's state. Ideally, clients should query multiple nodes and compare their topology views to determine the most consistent representation. To facilitate efficient comparison on the client side, if Valkey can offer a deterministic view where nodes with identical views return identical outputs, clients could easily hash and compare these outputs. ATM, certain variables in the CLUSTER SHARDS command, such as "replication-offset," may differ between nodes even when they share the same view, so the client must parse the view and filter out each view by itself. Furthermore, in the current output of CLUSTER SHARDS (or CLUSTER SLOTS), nodes with identical views may present the replicas of specific slots in differing orders. This necessitates clients to parse and sort the outputs for accurate view comparison.

Description of the feature

In order to achieve deterministic output, 2 changes are suggested:

  1. Adding a filter option to the CLUSTER SHARDS command. This enhancement would enable clients to selectively filter out irrelevant fields for their specific use cases, such as non-deterministic fields, nodes in 'fail' state that irrelevant for the client functionality, or other fields that the client doesn't need. It would enable hash-based comparison of views without the need for parsing the output, while also potentially reducing the size of each node's output.

  2. Sorting the replica order in the output.

@madolson
Copy link
Member

madolson commented Apr 2, 2024

Adding a filter option to the CLUSTER SHARDS command. This enhancement would enable clients to selectively filter out irrelevant fields for their specific use cases, such as non-deterministic fields, nodes in 'fail' state that irrelevant for the client functionality, or other fields that the client doesn't need. It would enable hash-based comparison of views without the need for parsing the output, while also potentially reducing the size of each node's output.

I'm generally against a lot of complexity like what we would need for filtering. So I would prefer we just add CLUSTER SHARDS DETERMINISTIC (I don't like the word, maybe canonical?) and allow clients to optionally call that and we uphold that type of contract on the server.

Sorting the replica order in the output.

I agree with this.

@daniel-house
Copy link
Member

I like the word deterministic. If the output is not sorted it can't be either deterministic or canonical, so I like it being sorted.

Clearly you feel that the fields are not presented in a way that is easy for the client to filter. Perhaps deterministic or canonical, or one other keyword, could change the format to something that would be trivial to filter - maybe even JSON.

@madolson
Copy link
Member

madolson commented Apr 4, 2024

Clearly you feel that the fields are not presented in a way that is easy for the client to filter. Perhaps deterministic or canonical, or one other keyword, could change the format to something that would be trivial to filter - maybe even JSON.

Not sure if the format is the concern. Last time I discussed with Bar ( who works at AWS) was that it was simpler and faster to do it without parsing. Maybe there is more to it though.

@hpatro hpatro self-assigned this Apr 12, 2024
@PingXie
Copy link
Member

PingXie commented Apr 13, 2024

The way I see it, the problem with CLUSTER SHARDS is that it mixes both "configs", cluster topology in this case, and "states" information in the output, such as replication offset and health. So maybe another word for "deterministic" is "topology", which reflects better the requirement here. "config" would work for me too.

"deterministic" on the other hand is a bit subjective and leaves room for different interpretation, IMO.

@madolson madolson added the major-decision-pending Major decision pending by TSC team label Apr 14, 2024
@zuiderkwast
Copy link
Contributor

zuiderkwast commented Apr 15, 2024

@madolson Why is this a major decision? Sorry, I'm reading properly. Not.

Sorting the nodes can't be a breaking change (like @hpatro did for CLUSTER SLOTS in #265).

Regarding filtering arguments, I'm not so sure... For hashing the output without parsing, I think that's one of the reasons CLUSTER NODES returns a string rather than RESP-structured data.

For structured RESP reply like this, a client needs to parse the RESP reply to know where the end of the reply is. You'd need a very special parser to find the length of the response without parsing it. So I think it'd fine that clients can get the reply, then remove some fields (which we can document), then compute a hash of the rest of it.

@PingXie
Copy link
Member

PingXie commented Apr 15, 2024

So I think it'd fine that clients can get the reply, then remove some fields (which we can document), then compute a hash of the rest of it.

I think the proposal is to have the server remove volatile fields, as opposed to clients doing the filtering.

@zuiderkwast
Copy link
Contributor

@PingXie I know that's the suggestion, but adding filter arguments to the command is a larger change than to document what clients need to do if they want to compute this kind of hash.

@PingXie
Copy link
Member

PingXie commented Apr 15, 2024

I'd think a binary filtering would be acceptable, like "topology or all". There is a straight story here IMO. Fine-grained filtering, agreed.

@PingXie PingXie moved this to Todo in Valkey 8.0 Apr 15, 2024
@daniel-house
Copy link
Member

The problem seems to be

Currently, some clients rely on a single-node perspective to obtain cluster topology, which may not accurately reflect the entire cluster's state.

The rest of the initial post appears to be offering ways to make it easier for a client to get an accurate picture of the entire cluster's state.

I think this problem may be theoretically unsolvable. If so, let us seek a sufficiently useful approximate picture.

Suppose we had some report from each node that was deterministically comparable, as requested by Bar. (Note: there is nothing subjective about the word deterministic. As Bar wrote: nodes with identical views return identical outputs. However, the choice of deterministic output is subjective as it depends on how we will use it.) Would it be sufficient to find a quorum that is reporting exactly the same state? How big would the quorum have to be? Can we determine if the cluster's topology is in flux due to adding or removing nodes, slot-migration or fail-over, so that we might need to try again later?

What is the ultimate use case? Why do we need an accurate picture of the entire cluster's state? Are we presenting a dashboard to the admins? Are we trying to update the client's slot-map for following moved keys? Are we building a replacement for Sentinel? Do we want to give advice for re-sharding? All of the above?

Would the report be more useful for more use cases if the server did some filtering? Would it be more useful if it were trivial to parse using easily available libraries (e.g., JSON)?

Should it be one or more new CLUSTER subcommands such as STATE, TOPOLOGY, or HEALTH?

@daniel-house
Copy link
Member

There was some discussion of cluster V2 in #58 .

Could someone link information about cluster V2 into this issue? I don't know anything about it except what I gathered from issue 58, but it seems to me that cluster V2 might provide a better solution for the use case that motivates this discussion of CLUSTER SHARDS.

@PingXie
Copy link
Member

PingXie commented Apr 17, 2024

What is the ultimate use case?

cluster v1 is eventual consistent (if ever). whenever there is a topology change, such as adding replicas, scaling out/re-sharding, or performing rolling upgrades, there needs to be a way for the client to detect that the cluster topology quiesces before the client can safely update its view of the slot->node mapping. The signal that @barshaul is looking for here is that "majority of nodes agree on the cluster topology", as reported by CLUSTER SHARDS.

Why do we need an accurate picture of the entire cluster's state?

Request routing is performed by the client based on the key hash. The client computes the crc16 of the key in question to figure out the (logical) slot it belongs to. However, in order to figure out which node hosts the slot, the client relies on this slot->node mapping, which is obtained by one of the cluster topology query commands such as CLUSTER SHARDS.

Are we presenting a dashboard to the admins?

Not the primary use case

Are we trying to update the client's slot-map for following moved keys?

It is a type of cluster topology change.

Are we building a replacement for Sentinel? Do we want to give advice for re-sharding? All of the above?

Orthogonal to sentinel. This is all about clusters.

Would the report be more useful for more use cases if the server did some filtering?

Server filter vs client filtering boils down to implement-it-once (on the server side) vs implement-it-many-times (for every client interested)

Would it be more useful if it were trivial to parse using easily available libraries (e.g., JSON)?

That would be a breaking change.

@daniel-house
Copy link
Member

I would never suggest replacing what we have today with JSON because that would indeed be a breaking change. Would allowing the client to request JSON instead of the current format be better than having the client specify a filter?

Please help me continue to pin down the requirements. At this point they appear to be to provide a way for the client to

  1. detect that the cluster topology is not in flux, or
  2. determine that a majority (quorum = greater than 50%) of nodes agree on the cluster topology, or
  3. determine the most consistent representation of the current topology of the cluster, or
  4. all of the above.

In all of these cases I think we can do much better by providing a new CLUSTER subcommand.

For example, suppose the goal is (2) determine that we have a quorum (the percentage might be a variable). We could make it much easier to wait for a quorum by associating a unique, non-decreasing ID with each new topology. Then the client only needs to poll until a quorum reports the same maximal ID. Note: I have an intense dislike for polling, and would look for a better solution.

I can think of lots of possible solutions, but I'd rather wait until we have a solid agreement on the goal.

@PingXie
Copy link
Member

PingXie commented Apr 18, 2024

JSON or not is not at the core of this problem. The output of the existing CLUSTER SHARD is already conceptually JSON. The content is actually what is being discussed in this thread.

associating a unique, non-decreasing ID with each new topology

Can you elaborate how this ID would be maintained? There is the epoch that has this monotonically increasing property but it is a per shard concept and it is not reported by CLUSTER SHARDS

I can think of lots of possible solutions, but I'd rather wait until we have a solid agreement on the goal.

Yes and I would say no.2 and 3 are my understanding of the goal. that said, these are unfortunately band-aid requirements/solutions for cluster v1 IMO. I say we really need the strong consistency for the topology (#58 (comment))

@barshaul
Copy link
Contributor Author

barshaul commented Aug 5, 2024

Hey @madolson @PingXie,
Are we moving forward with #411? Do we have an estimated release version for it?

@madolson
Copy link
Member

madolson commented Aug 5, 2024

@barshaul There is no active plan, but we moved the discussion about accepting it to #517. I think we weren't able to decide between tweaking cluster slots and modifying cluster shards.

@PingXie
Copy link
Member

PingXie commented Aug 12, 2024

I don't see a clear winner between the two, either.

Now that we reverted the decision on the deprecation of cluster slots and are open to iterating cluster slots further, I think we just created a situation of redundant commands. In theory, if we are open to introducing more flavors of cluster slots, we could just merge cluster shards into cluster slots as a "dialect".

We might need to eventually implement this "topology query" support in both commands, but for now, I would go with whichever one that is easier for clients to adopt.

@zuiderkwast zuiderkwast linked a pull request Aug 12, 2024 that will close this issue
@zuiderkwast
Copy link
Contributor

We should evaluate the use cases for each of these commands.

I think we should continue to prefer CLUSTER SLOTS for clients, optionally with the new dense version for clients that support it. We already added caching for the CLUSTER SLOTS responses. AFAICT, no clients have started using CLUSTER SHARDS yet anyway.

We can think of CLUSTER SHARDS as an admin command, as a structured variant of CLUSTER NODES. The #411 PR adds some complexity and has a negative performance impact. Maybe it doesn't need to be deterministic, since the caller can sort and filter the result, so I'm not sure if #411 is necessary. I don't think we should add caching of the CLUSTER SHARDS reply either.

@madolson madolson added this to Roadmap Aug 28, 2024
@madolson madolson moved this to Researching in Roadmap Aug 28, 2024
@madolson madolson removed this from Valkey 8.0 Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major-decision-pending Major decision pending by TSC team
Projects
Status: Researching
Development

Successfully merging a pull request may close this issue.

6 participants