Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should functions be synchronized among cluster nodes? #58

Open
CharlesChen888 opened this issue Mar 28, 2024 · 24 comments
Open

Should functions be synchronized among cluster nodes? #58

CharlesChen888 opened this issue Mar 28, 2024 · 24 comments
Labels

Comments

@CharlesChen888
Copy link
Member

This issue was first raised in Redis community redis/redis#11780, it is a major miss from Redis 7.0, but haven't got a solution. I believe we still need to discuss this issue in this new project, perhaps as a part of new cluster architecture.

The problem:

Let's say:

  • a slot, which includes key1, is migrated from node A to node B, operated by client1
  • in node A a function myfunc is already loaded, but in node B it is not

Now client2, that used to handle key1 in node A, sends FCALL myfunc 1 key1 to node A, which will be redirected to node B, will then get an unexpected error.
Or if node B also has a function myfunc but it does different things, client2 may trigger unexpected results.

Some thoughts:

1.Should functions be migrated together with slots?
Maybe not. This may overwrite the libraries already loaded in the nodes importing slots. But we may set a version to each library so that we can always choose to keep the latest one.

2.Can functions be broadcasted to all nodes?
May cause confict between nodes. And when adding new nodes we need to have a full synchronization.

@mattsta mattsta mentioned this issue Mar 28, 2024
10 tasks
@hpatro
Copy link
Contributor

hpatro commented Mar 28, 2024

1.Should functions be migrated together with slots?

I don't think slots should be associated with functions. functions should be treated as a resource at node level. On each node addition, all functions existing in the cluster/standalone primary should be applied.

Overall, I think we should treat functions, ACLs and config(s) as resource(s) which should be applied to all the nodes and build it in the cluster revamp which we have been talking for a while.

@zuiderkwast
Copy link
Contributor

zuiderkwast commented Apr 10, 2024

IIRC it was decided that functions are similar to configuration and that it's the admin's responsibility to define functions on all nodes. I don't know if it's a good decision.

The official admin tool is valkey-cli, so when managing a cluster (creating a cluster, adding nodes, etc.) perhaps valkey-cli should try to synchronous functions across nodes? That would at least match the current design decisions.

It's the same problem with ACL and other config in a cluster: They are expected to be the same throughout the cluster but the cluster doesn't guarantee this.

I'm not sure I believe in a complete rewrite of the cluster bus. Usually such projects are never finished. Can we try gossipping this information instead?

@madolson
Copy link
Member

IIRC it was decided that functions are similar to configuration and that it's the admin's responsibility to define functions on all nodes. I don't know if it's a good decision.

Zhao can keep me honest, but this is the only time there was a real 3-2 decision amongst the old maintainers. (It was a more broad discussion on functions, but basically they said this is what Redis wanted).

I generally want to implement a new system that is distributed configuration, that supports functions, ACLS, configs and allows modules to also store distributed configuration. This can be built ontop of the existing clusterbus extensions, but will be in its best form when built with the cluster v2. (I see Viktor is dubious, but Ping and I are highly motivated to make this happen now)

@zuiderkwast
Copy link
Contributor

I'm motivated, but I also realize we have a lot of things going on, and I want us to make releases more often. Can we make it a focus area for 9.0? (That means start prototyping now.)

@madolson
Copy link
Member

I'm motivated, but I also realize we have a lot of things going on, and I want us to make releases more often. Can we make it a focus area for 9.0? (That means start prototyping now.)

My vision, is that we finish the needed refactoring work to make cluster a module for valkey 8. Then it becomes easy to iterate and prototype in a separate repo the new clustering work. Then someone can "opt-in" to the beta of it while Valkey 8 is out and about, and we can finalize it for 9.

@zuiderkwast
Copy link
Contributor

I think we shall start using some of the reserved bits in the clusterMsg to indicate support for new cluster bus messages. From there, we can add new message types and turn the cluster bus more into a proper raft cluster, step by step.

Easier to test, easier to get things right, if we do things in smaller steps.

@PingXie
Copy link
Member

PingXie commented Apr 16, 2024

Timeline aside, this is actually a problem that can ONLY be properly solved by cluster V2 because it gives you a strongly consistent (meta)data service that manages cluster-wide metadata like cluster topology, node configs, ACLs, and Functions, etc

Cluster V1 being eventually consistent (which is not 100% true by the way) and the coupling of data serving logic and topology management logic on the same node are the source of these oddities, IMO. I also think implementing a true raft cluster inside of the Valkey server is no less work than cluster V2 itself. The coupling shouldn't be there in the first place.

I am in favor of moving the entire cluster out of the engine and into its own module and this needs to be the precursor to the real cluster V2 work. That said, I don't have a clear idea on the complexity involved so I don't know if it can happen in Valkey 8, yet. Also agreed that building cluster V2 into the engine is a non-starter.

@zuiderkwast
Copy link
Contributor

zuiderkwast commented Apr 16, 2024

It's very good that we're having this discussion. We need to have all the information and aspects on the table to be able to make informed decisions.

Timeline aside

Beautiful, but I don't think we shall ignore time.

Just adding module APIs normally involves a lot of work just in terms of API design, testing and documentation. This makes adding APIs much more work than adding internal functions.

Which module API functions do we need to add and what will these APIs look like? Will the cluster module keep all the cluster state in its own memory, intercept commands and sometimes return redirects for them?

cluster V2 because it gives you a strongly consistent (meta)data service that manages cluster-wide metadata like cluster topology, node configs, ACLs, and Functions, etc

So we need to have module APIs for all of those, including some event mechanism so if a user or any other module changes any of these, so the cluster module can get these changes. (Not a very small and easy job.)

The coupling shouldn't be there in the first place.
Also agreed that building cluster V2 into the engine is a non-starter.

Who agreed on this? If we're using a raft library, we avoid that coupling. A node can still be a cluster node without serving any data. Agreed that moving things to a module is a guarantee when it comes to separation of concerns, but it is also much more work.

@daniel-house
Copy link
Member

Can anyone suggest a use case where we would actually prefer to have different implementations of the same function on different nodes? If not, then we have the answer to the question raised by this issue.

@zuiderkwast
Copy link
Contributor

@daniel-house You're right, it's better that they are synchronized. The open question is how.

Can you suggest how to implement it within the current cluster solution that makes sure they are synchronized on all nodes?

@daniel-house
Copy link
Member

OK, we agree that they should be synchronized.

I can't say I understand the current cluster solution, but I can see that we are having problems synchronizing hostnames (#304), so I have serious doubts that we can synchronize functions at this time. It seems to me that a good solution would also solve the hostnames issue and the problems raised in #114 . If anyone can point me to useful reading about the current cluster solution, please do.

I would appreciate any pointers to more information about cluster V2. I get the general impression that it will use Raft leader election, and would be perfect for this, hostnames, and the ultimate goals of #114 . If so, I would suggest punting all of these issues until after cluster V2 is implemented and assigning a very high priority to cluster V2.

@zuiderkwast
Copy link
Contributor

If anyone can point me to useful reading about the current cluster solution, please do.

@zuiderkwast
Copy link
Contributor

If we want to do something like the Redis cluster v2, then I believe we can have cluster V2 in a few years, say Valkey 10 or 11, maybe not even that. Rewriting something from scratch that can't be used until everything is in place is very likely to never be launched IMO. That's why I think Redis will fail. I hope we will choose a different, incremental approach. It's not that easy but I believe it's necessary. (We haven't really discussed it enough in the Valkey core team.)

@daniel-house
Copy link
Member

If anyone can point me to useful reading about the current cluster solution, please do.

Thanks. I've looked at those in the past, and again today. They are indeed useful. (1) Are the docs up to date? (2) The docs seem too superficial for this purpose (proposing an extension to synchronize functions). Searching those docs for host doesn't lead to information about a hostname extension as described in #304 . I was hoping for the design or at least the discussion/issue and PR that lead to #304 .

The information about a gossip protocol is helpful, but which one?

  • The source code

Ha, ha. Very funny. And, sadly, all I really expected to find.

According to the cluster-spec:

For every node added in the gossip section the following fields are reported:

Node ID.
IP and port of the node.
Node flags.

Why didn't we add the hostname here instead of ping/pong?

I'll read some more from those docs and the code.

@daniel-house
Copy link
Member

If we want to do something like the Redis cluster v2, then I believe we can have cluster V2 in a few years, say Valkey 10 or 11, maybe not even that. Rewriting something from scratch that can't be used until everything is in place is very likely to never be launched IMO. That's why I think Redis will fail. I hope we will choose a different, incremental approach. It's not that easy but I believe it's necessary. (We haven't really discussed it enough in the Valkey core team.)

This link is something I really needed. Thankyou.

So, you think cluster V2 will never happen. Is that the consensus? I will read it in detail. Even if it is far in the future it will help me understand today's pain points. Are there any places I should subscribe to so that I can follow any discussions of cluster V2?

How about this incremental approach? Link in a high quality implementation of the Raft protocol. Define a state-machine to manage with the Raft protocol. Only put one thing in it, a list of node-IDs with hostnames. After that works, add function definitions to the state-machine. We should then be able to incrementally extend the state-machine to cover everything needed for #114

Alternatively, use CRDT registers (simple strings) to follow the same incremental path.

@zuiderkwast
Copy link
Contributor

So, you think cluster V2 will never happen.

That's not what I said. (I said that if we want to do it in a certain non-incremental way, then it will be difficult.)

Is that the consensus?

No.

@PingXie
Copy link
Member

PingXie commented Apr 17, 2024

I don't think we should reference back to the old cluster V2 design here in the Valkey forum. That work was started by Redis. We should take a fresh look at the problems again and start from scratch. The only connection is just the name "cluster V2". Please disregard the "Redis Cluster V2" design completely.

@daniel-house
Copy link
Member

@PingXie Works for me. Is there a list of problems/requirements?

@hpatro
Copy link
Contributor

hpatro commented Apr 17, 2024

@PingXie Works for me. Is there a list of problems/requirements?

Few of the problems/requirements:

  • Support higher Scaling (Larger cluster size) - Currently Valkey cluster can't scale beyond 500 nodes beyond which the gossip protocol doesn't scale well. The new architecture should strive to handle much larger cluster size.
  • Support smaller cluster size - With the current design, only primary nodes can vote hence the ideal cluster would require atleast 3 primaries to form a quorum. The new design should take into consideration where each node (primary/replica) should be part of the quorum.
  • Centralized metadata store - Currently config, functions, acl needs to be applied on each node. This is painful and cumbersome during cluster setup as well as during scale out. Building a mechanism to be able to send those information once and being applied to all of the nodes would be ideal.
  • Decouple control and data transfer via cluster bus - Currently, the clusterbus carries both control data (health info, gossip info, etc) as well as pubsub data. With high pubsub data, the clusterbus can get overwhelmed and cause issue with the cluster health status and can delay cluster topology information consistency in the system.
  • Lower failover time - Due to the primary node only consensus mechanism in the current approach, failover in a shard can be delayed if there is a network partition and other primaries aren't reachable.

@PingXie
Copy link
Member

PingXie commented Apr 18, 2024

Thanks @hpatro.

Here are the problems that I think we need to solve in the current cluster:

  1. strong consistency (for cluster topology)

cluster topology is concerned with which nodes own which slots and primaryship. The current cluster implementation is not even eventually consistent by design because there are places where node epochs are bumped without consensus (trade-offs). This leads to increased complexity on the client side.

  1. better manageability (of global config/data)

This particular issue provides the exact context on this pain point

  1. more resilience (to stressful client workload)

Today, both the cluster bus and the client workload run on the same main thread. So a demanding client workload has the potential to starve the cluster bus and leads to unnecessary failover.

  1. higher scale

The V1 cluster is a mesh so the cluster gossip traffic is proportional to N^2, where N is the (data) nodes in the cluster. The practical limit of a V1 cluster is ~500 nodes.

@daniel-house
Copy link
Member

I don't see how you would perform a rolling upgrade to a cluster that meets these requirements. Is that acceptable?

@PingXie
Copy link
Member

PingXie commented Apr 25, 2024

I don't see how you would perform a rolling upgrade to a cluster that meets these requirements. Is that acceptable?

We haven't finalized the v1 to v2 upgrade migration path, including whether such a need exists or not. A seamless migration path is always great but I can also see the two cluster implementations continue to evolve on their own for a long time. So in this sense, v2 is not an accurate moniker, strictly speaking.

@zuiderkwast
Copy link
Contributor

We haven't finalized almost anything about v2. :)

I think we need many different threads to discuss all these aspects. Shall we use a wiki for that? Or just many issues and sub-issues to discuss?

I think we should really explore the possibilities for a running cluster to auto-upgrade to a new consensus mode. For example, the node with the highest epoch can decide to switch to a raft algorithm, when all known nodes are known to support it, and start sending out new messages on a new format. When such message reach the other nodes, they switch too. We can use the existing TCP connections. (We could use UDP but it's harder to get DTLS working (OpenSSL doesn't even have DTLS 1.3 yet) than to just use TCP and TLS. But that's a different discussion.)

This sketch ↑ is just to reason that it's theoretically possible. Then we can weigh how important it is, compared to safety, simplicity, separation and other aspects.

@PingXie
Copy link
Member

PingXie commented Apr 25, 2024

We haven't finalized almost anything about v2. :)

True but my statement is correct too :-)

I think we need many different threads to discuss all these aspects. Shall we use a wiki for that? Or just many issues and sub-issues to discuss?

Yeah, I will create a dedicated issue for cluster v2. Let's start from the pain points in today's cluster design/implementation. It is important to have a consensus on the problem definition first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants