Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only (re-)send MEET packet once every handshake timeout period #1441

Merged
merged 9 commits into from
Dec 30, 2024

Conversation

pieturin
Copy link
Contributor

@pieturin pieturin commented Dec 13, 2024

Add meet_sent field in clusterNode indicating the last time we sent a MEET packet. Use this field to only (re-)send a MEET packet once every handshake timeout period when detecting a node without an inbound link.

When receiving multiple MEET packets on the same link while the node is in handshake state, instead of dropping the packet, we now simply prevent the creation of a new node. This way we still process the MEET packet's gossip and reply with a PONG as any other packets.

Improve some logging messages to include human_nodename. Add nodeExceedsHandshakeTimeout() function.

This is a follow-up to this previous PR: #1307
And a partial fix to the crash described in: #1436

@pieturin pieturin force-pushed the cluster-handshake-fix-followup branch 2 times, most recently from 629bfaa to e9b1b3c Compare December 13, 2024 23:53
@pieturin
Copy link
Contributor Author

@enjoy-binbin, @PingXie, please take a look at this PR when you have the chance.

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I also thought about limiting it, but I thought we should be more robust to double packets.

Add meet_sent field in clusterNode indicating the last time we sent a
MEET packet. Use this field to only (re-)send a MEET packet once every
handshake timeout period.
Improve some logging messages to include human_nodename.
Add nodeExceedsHandshakeTimeout() function.

Signed-off-by: Pierre Turin <[email protected]>
Signed-off-by: Pierre Turin <[email protected]>
src/cluster_legacy.c Outdated Show resolved Hide resolved
When receiving multiple MEET packets on the same link while the node is
in handshake state, instead of dropping the packet, we now simply
prevent the creation of a new node. This way we still process the MEET
packet's gossip and reply with a PONG as any other packets.

Signed-off-by: Pierre Turin <[email protected]>
@pieturin pieturin force-pushed the cluster-handshake-fix-followup branch from e9b1b3c to 4f82271 Compare December 17, 2024 00:06
@pieturin
Copy link
Contributor Author

In the last revision of this PR, I removed the code to drop the MEET packet from #1436 and replaced it with a check to prevent from creating a new node when receiving several MEET packets on the same link while the node is in handshake.

@enjoy-binbin, @madolson, @PingXie, please tell me if you prefer this over dropping the packet entirely.

Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall.

I'd prefer not dropping the second MEET. Not because the dropping itself is a concern but the fix is a bit ad-hoc. I feel that it is just a valid case and we should just handle it.

Another comment is on the new inbound_link_freed_time field. I feel that it is not created at the right level (of abstraction). It should not be a concern of freeClusterLink. I actually like the new meet_sent more. What do you think?

src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Show resolved Hide resolved
src/cluster_legacy.c Show resolved Hide resolved
@enjoy-binbin
Copy link
Member

When receiving multiple MEET packets on the same link while the node is in handshake state, instead of dropping the packet, we now simply prevent the creation of a new node. This way we still process the MEET packet's gossip and reply with a PONG as any other packets.

ok, this LGTM and it was actually my first attempt, back then i mistakenly thought that repeated MEET would eliminate the handshake state, but now that I look at the code I think it is safe.

@pieturin
Copy link
Contributor Author

Another comment is on the new inbound_link_freed_time field. I feel that it is not created at the right level (of abstraction). It should not be a concern of freeClusterLink. I actually like the new meet_sent more. What do you think?

I don't think we can get rid of inbound_link_freed_time and only use meet_sent. We don't want to send a MEET packet every time the inbound connection is dropped. We should wait for the timeout period to first make sure the other side hasn't tried to re-connect with us.

Why do you think it should not be a concern of freeClusterLink()? This function already updates the node's link or inbound_link, so it seems natural for it to also update inbound_link_freed_time.

Address PR comments.

Signed-off-by: Pierre Turin <[email protected]>
Node 1 might not be known from Node 0 if Node 2 does not gossip about it
to Node 1.

Signed-off-by: Pierre Turin <[email protected]>
Signed-off-by: Pierre Turin <[email protected]>
Copy link

codecov bot commented Dec 17, 2024

Codecov Report

Attention: Patch coverage is 96.87500% with 1 line in your changes missing coverage. Please review.

Project coverage is 70.87%. Comparing base (980a801) to head (a542eb5).
Report is 24 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 96.87% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1441      +/-   ##
============================================
+ Coverage     70.78%   70.87%   +0.08%     
============================================
  Files           119      119              
  Lines         64691    64704      +13     
============================================
+ Hits          45790    45856      +66     
+ Misses        18901    18848      -53     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.73% <96.87%> (-0.13%) ⬇️

... and 27 files with indirect coverage changes

@madolson
Copy link
Member

When receiving multiple MEET packets on the same link while the node is in handshake state, instead of dropping the packet, we now simply prevent the creation of a new node. This way we still process the MEET packet's gossip and reply with a PONG as any other packets.

Btw, I also agree with this. I mentioned on the other thread I think this is the better behavior, but people were seeing a lot of CI failures so I was okay with a temporary solution.

@pieturin
Copy link
Contributor Author

The code coverage reports one missing line:

debugServerAssert(link->inbound && nodeInHandshake(link->node));

This line should not be reached anymore since we are waiting for the handshake timeout before re-sending a MEET packet.

@PingXie
Copy link
Member

PingXie commented Dec 18, 2024

Another comment is on the new inbound_link_freed_time field. I feel that it is not created at the right level (of abstraction). It should not be a concern of freeClusterLink. I actually like the new meet_sent more. What do you think?

I don't think we can get rid of inbound_link_freed_time and only use meet_sent. We don't want to send a MEET packet every time the inbound connection is dropped. We should wait for the timeout period to first make sure the other side hasn't tried to re-connect with us.

Why do you think it should not be a concern of freeClusterLink()? This function already updates the node's link or inbound_link, so it seems natural for it to also update inbound_link_freed_time.

freeClusterLink() is invoked in various scenarios, but the new state we're introducing is only relevant in the handshake path. This suggests that the scope of the new state is broader than necessary.

On a second thought, I think a better name would be handshake_start_time. This name better reflects its purpose.

  • On the initiator side, handshake_start_time would be set whenever a MEET command is sent.
  • On the recipient side, it would be set when the first PING message is sent on the outgoing link for a node in the HANDSHAKE state.

@pieturin
Copy link
Contributor Author

On a second thought, I think a better name would be handshake_start_time. This name better reflects its purpose.

I still think having this is not enough. We should still have inbound_link_freed_time to know if the link was freed recently in case it's just a temporary reconnect between the two nodes, we don't want to send a MEET packet because the handshake happened days ago, and now the inbound link is NULL for a fraction of a second. Does this make sense?
And I can't use ping_sent/pong_received/data_received since those are also updated for the outbound link.

freeClusterLink() is invoked in various scenarios, but the new state we're introducing is only relevant in the handshake path. This suggests that the scope of the new state is broader than necessary.

But we need to update inbound_link_freed_time every time the inbound link is freed in order to make sure we don't send the MEET packet on a short disconnect.

I guess a way to get rid of inbound_link_freed_time would be to replace it with another field that would track if the inbound_link was ever opened on a node, something like was_inbound_link_set.

With this, we could replace the condition with something like:

    if (nodeInNormalState(node) && node->link != NULL && node->inbound_link == NULL &&
        nodeExceedsHandshakeTimeout(node) && now - node->meet_sent > getHandshakeTimeout() &&
        !node->was_inbound_link_set) {

What do you think @PingXie?

@pieturin
Copy link
Contributor Author

I had a talk offline with @PingXie, and we think we can change the condition to send a MEET packet to something like:

    if (nodeInNormalState(node) && node->link != NULL && node->inbound_link == NULL &&
        nodeExceedsHandshakeTimeout(node) &&
        now - node->handshake_start_time > getHandshakeTimeout() && now - node->handshake_start_time < getHandshakeTimeout() * 2) {

And with that we should be able to get rid of inbound_link_freed_time.

@pieturin
Copy link
Contributor Author

A crash was detected recently during the daily runs: https://github.com/valkey-io/valkey/actions/runs/12403569355/job/34627291048#step:8:3854

17847:M 19 Dec 2024 00:15:58.007 - Accepting cluster node connection from 127.0.0.1:45768
17847:M 19 Dec 2024 00:15:58.021 * Error converting peer IP to string: (null)


=== VALKEY BUG REPORT START: Cut & paste starting from here ===
17847:M 19 Dec 2024 00:15:58.021 # === ASSERTION FAILED ===
17847:M 19 Dec 2024 00:15:58.021 # ==> cluster_legacy.c:3252 'nodeIp2String(node->ip, link, hdr->myip) == C_OK' is not true

------ STACK TRACE ------

17847 valkey-server *
src/valkey-server 127.0.0.1:27131 [cluster](clusterProcessPacket+0x1304) [0x4e5634]
src/valkey-server 127.0.0.1:27131 [cluster](clusterReadHandler+0x11e) [0x4e59de]
/__w/valkey/valkey/src/valkey-tls.so(+0x2f1e) [0x7f083983ff1e]
src/valkey-server 127.0.0.1:27131 [cluster](aeMain+0x8a) [0x41afea]
src/valkey-server 127.0.0.1:27131 [cluster](main+0x4d7) [0x40f547]
/lib64/libc.so.6(+0x40c8) [0x7f083985a0c8]
/lib64/libc.so.6(__libc_start_main+0x8b) [0x7f083985a18b]
src/valkey-server 127.0.0.1:27131 [cluster](_start+0x25) [0x410ef5]

...

# Cluster info
cluster_state:fail
cluster_slots_assigned:0
cluster_slots_ok:0
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:2
cluster_size:0
cluster_current_epoch:0
cluster_my_epoch:0
cluster_stats_messages_ping_sent:1
cluster_stats_messages_pong_sent:1
cluster_stats_messages_sent:2
cluster_stats_messages_meet_received:2
cluster_stats_messages_received:2
total_cluster_links_buffer_limit_exceeded:0

------ CLUSTER NODES OUTPUT ------
de11af5df0d8b71a5df786c7a1e7ab0a0beab7fc 127.0.0.1:27128@37127,,tls-port=27127,shard-id=2ea5410be1f7dcac4755a6c9bcc60558e3f4f14b handshake - 1734567357566 0 0 connected
26d986e4e07f0d2c768d62dfdaf9f0ae03d18e49 127.0.0.1:27131@27129,,tls-port=27130,shard-id=236191ce869481919bb106a49cb235538b419178 myself,master - 0 0 0 connected

My proposal after talking with @madolson is to remove this assert and instead just remove the newly created node and close the link, since if we cannot get the IP from the link it probably means the connection was closed.

serverAssert(nodeIp2String(node->ip, link, hdr->myip) == C_OK);

@pieturin
Copy link
Contributor Author

Doing the solution with handshake_start_time doesn't work with the existing tests. I see that a node is sending a MEET packet just because of a temporary disconnect during one of the cluster-reliable-meet tests. Because the other nodes frees the link and reconnects because it didn't receive any PONG for more than server.cluster_node_timeout / 2. And then the cluster-reliable-meet test fails because the node with the inconsistent view never sends the MEET (not sure why yet though). So I don't think it's worth trying to make this work. It seems more finicky than relying on inbound_link_freed_time.

When we cannot get the IP address from a link, instead of asserting we
now free the link and delete the node.

Signed-off-by: Pierre Turin <[email protected]>
Signed-off-by: Pierre Turin <[email protected]>
Signed-off-by: Pierre Turin <[email protected]>
Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor wording changes in the code comments and logs. Thanks @pieturin!

src/cluster_legacy.c Show resolved Hide resolved
src/cluster_legacy.c Show resolved Hide resolved
src/cluster_legacy.c Show resolved Hide resolved
src/cluster_legacy.c Show resolved Hide resolved
@PingXie
Copy link
Member

PingXie commented Dec 22, 2024

And then the cluster-reliable-meet test fails because the node with the inconsistent view never sends the MEET (not sure why yet though)

I think I know why. The handshake_start_time is lost in this case along with the node that is created as a response to the first MEET. The node is then recreated via the gossip route, but this new node never initiates a handshake, so its handshake_start_time is never set. Here is the possible sequence:

  1. A cluster with two primaries, P1 and P2.
  2. A new node N is to be added.
  3. P1 is instructed to meet N, but P1 is also told to drop all cluster traffic from N. Therefore, from N's perspective, P1 remains in the handshake state.
  4. Eventually, P1's representation on N, a clusterNode instance, times out and gets deleted.
  5. However, when P1 sends out its first MEET, it also gossips about P2 to N, which shows up as a normal node on N.
  6. P2 then gossips P1 back to N, now as a normal node! This is why handshake_start_time is never initialized for this new version of P1's representation on N.

I think this is just another argument for #1471.

@hwware hwware merged commit e4179f1 into valkey-io:unstable Dec 30, 2024
50 checks passed
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Jan 9, 2025
In valkey-io#1441, we found a assert, and decided remove this assert and instead just
free the newly created node and close the link, since if we cannot get the
IP from the link it probably means the connection was closed.
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
17847:M 19 Dec 2024 00:15:58.021 # === ASSERTION FAILED ===
17847:M 19 Dec 2024 00:15:58.021 # ==> cluster_legacy.c:3252 'nodeIp2String(node->ip, link, hdr->myip) == C_OK' is not true

------ STACK TRACE ------

17847 valkey-server *
src/valkey-server 127.0.0.1:27131 [cluster](clusterProcessPacket+0x1304) [0x4e5634]
src/valkey-server 127.0.0.1:27131 [cluster](clusterReadHandler+0x11e) [0x4e59de]
/__w/valkey/valkey/src/valkey-tls.so(+0x2f1e) [0x7f083983ff1e]
src/valkey-server 127.0.0.1:27131 [cluster](aeMain+0x8a) [0x41afea]
src/valkey-server 127.0.0.1:27131 [cluster](main+0x4d7) [0x40f547]
/lib64/libc.so.6(+0x40c8) [0x7f083985a0c8]
/lib64/libc.so.6(__libc_start_main+0x8b) [0x7f083985a18b]
src/valkey-server 127.0.0.1:27131 [cluster](_start+0x25) [0x410ef5]
```

But it also introduces another assert. The reason is that this new node is
not added to the cluster nodes dict, we should just free it.
```
17128:M 08 Jan 2025 10:51:44.061 # === ASSERTION FAILED ===
17128:M 08 Jan 2025 10:51:44.061 # ==> cluster_legacy.c:1693 'dictDelete(server.cluster->nodes, nodename) == DICT_OK' is not true

------ STACK TRACE ------

17128 valkey-server *
src/valkey-server 127.0.0.1:28627 [cluster][0x4ebdc4]
src/valkey-server 127.0.0.1:28627 [cluster][0x4e81d2]
src/valkey-server 127.0.0.1:28627 [cluster](clusterReadHandler+0x268)[0x4e8618]
/__w/valkey/valkey/src/valkey-tls.so(+0xb278)[0x7f109480b278]
src/valkey-server 127.0.0.1:28627 [cluster](aeMain+0x89)[0x592b09]
src/valkey-server 127.0.0.1:28627 [cluster](main+0x4b3)[0x453e23]
/lib64/libc.so.6(__libc_start_main+0xe5)[0x7f10958bf7e5]
src/valkey-server 127.0.0.1:28627 [cluster](_start+0x2e)[0x454a5e]
```

This closes valkey-io#1527.

Signed-off-by: Binbin <[email protected]>
enjoy-binbin added a commit that referenced this pull request Jan 10, 2025
)

In #1441, we found a assert, and decided remove this assert and instead
just free the newly created node and close the link, since if we cannot
get the IP from the link it probably means the connection was closed.
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
17847:M 19 Dec 2024 00:15:58.021 # === ASSERTION FAILED ===
17847:M 19 Dec 2024 00:15:58.021 # ==> cluster_legacy.c:3252 'nodeIp2String(node->ip, link, hdr->myip) == C_OK' is not true

------ STACK TRACE ------

17847 valkey-server *
src/valkey-server 127.0.0.1:27131 [cluster](clusterProcessPacket+0x1304) [0x4e5634]
src/valkey-server 127.0.0.1:27131 [cluster](clusterReadHandler+0x11e) [0x4e59de]
/__w/valkey/valkey/src/valkey-tls.so(+0x2f1e) [0x7f083983ff1e]
src/valkey-server 127.0.0.1:27131 [cluster](aeMain+0x8a) [0x41afea]
src/valkey-server 127.0.0.1:27131 [cluster](main+0x4d7) [0x40f547]
/lib64/libc.so.6(+0x40c8) [0x7f083985a0c8]
/lib64/libc.so.6(__libc_start_main+0x8b) [0x7f083985a18b]
src/valkey-server 127.0.0.1:27131 [cluster](_start+0x25) [0x410ef5]
```

But it also introduces another assert. The reason is that this new node
is not added to the cluster nodes dict.
```
17128:M 08 Jan 2025 10:51:44.061 # === ASSERTION FAILED ===
17128:M 08 Jan 2025 10:51:44.061 # ==> cluster_legacy.c:1693 'dictDelete(server.cluster->nodes, nodename) == DICT_OK' is not true

------ STACK TRACE ------

17128 valkey-server *
src/valkey-server 127.0.0.1:28627 [cluster][0x4ebdc4]
src/valkey-server 127.0.0.1:28627 [cluster][0x4e81d2]
src/valkey-server 127.0.0.1:28627 [cluster](clusterReadHandler+0x268)[0x4e8618]
/__w/valkey/valkey/src/valkey-tls.so(+0xb278)[0x7f109480b278]
src/valkey-server 127.0.0.1:28627 [cluster](aeMain+0x89)[0x592b09]
src/valkey-server 127.0.0.1:28627 [cluster](main+0x4b3)[0x453e23]
/lib64/libc.so.6(__libc_start_main+0xe5)[0x7f10958bf7e5]
src/valkey-server 127.0.0.1:28627 [cluster](_start+0x2e)[0x454a5e]
```

This closes #1527.

Signed-off-by: Binbin <[email protected]>
@pieturin pieturin deleted the cluster-handshake-fix-followup branch January 14, 2025 22:11
enjoy-binbin added a commit that referenced this pull request Jan 17, 2025
Update comments and log message in `cluster_legacy.c`.

Follow-up from #1441.

Signed-off-by: Pierre Turin <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
Co-authored-by: Binbin <[email protected]>
proost pushed a commit to proost/valkey that referenced this pull request Jan 17, 2025
…lkey-io#1535)

In valkey-io#1441, we found a assert, and decided remove this assert and instead
just free the newly created node and close the link, since if we cannot
get the IP from the link it probably means the connection was closed.
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
17847:M 19 Dec 2024 00:15:58.021 # === ASSERTION FAILED ===
17847:M 19 Dec 2024 00:15:58.021 # ==> cluster_legacy.c:3252 'nodeIp2String(node->ip, link, hdr->myip) == C_OK' is not true

------ STACK TRACE ------

17847 valkey-server *
src/valkey-server 127.0.0.1:27131 [cluster](clusterProcessPacket+0x1304) [0x4e5634]
src/valkey-server 127.0.0.1:27131 [cluster](clusterReadHandler+0x11e) [0x4e59de]
/__w/valkey/valkey/src/valkey-tls.so(+0x2f1e) [0x7f083983ff1e]
src/valkey-server 127.0.0.1:27131 [cluster](aeMain+0x8a) [0x41afea]
src/valkey-server 127.0.0.1:27131 [cluster](main+0x4d7) [0x40f547]
/lib64/libc.so.6(+0x40c8) [0x7f083985a0c8]
/lib64/libc.so.6(__libc_start_main+0x8b) [0x7f083985a18b]
src/valkey-server 127.0.0.1:27131 [cluster](_start+0x25) [0x410ef5]
```

But it also introduces another assert. The reason is that this new node
is not added to the cluster nodes dict.
```
17128:M 08 Jan 2025 10:51:44.061 # === ASSERTION FAILED ===
17128:M 08 Jan 2025 10:51:44.061 # ==> cluster_legacy.c:1693 'dictDelete(server.cluster->nodes, nodename) == DICT_OK' is not true

------ STACK TRACE ------

17128 valkey-server *
src/valkey-server 127.0.0.1:28627 [cluster][0x4ebdc4]
src/valkey-server 127.0.0.1:28627 [cluster][0x4e81d2]
src/valkey-server 127.0.0.1:28627 [cluster](clusterReadHandler+0x268)[0x4e8618]
/__w/valkey/valkey/src/valkey-tls.so(+0xb278)[0x7f109480b278]
src/valkey-server 127.0.0.1:28627 [cluster](aeMain+0x89)[0x592b09]
src/valkey-server 127.0.0.1:28627 [cluster](main+0x4b3)[0x453e23]
/lib64/libc.so.6(__libc_start_main+0xe5)[0x7f10958bf7e5]
src/valkey-server 127.0.0.1:28627 [cluster](_start+0x2e)[0x454a5e]
```

This closes valkey-io#1527.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: proost <[email protected]>
proost pushed a commit to proost/valkey that referenced this pull request Jan 17, 2025
Update comments and log message in `cluster_legacy.c`.

Follow-up from valkey-io#1441.

Signed-off-by: Pierre Turin <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
Co-authored-by: Binbin <[email protected]>
Signed-off-by: proost <[email protected]>
kronwerk pushed a commit to kronwerk/valkey that referenced this pull request Jan 27, 2025
…y-io#1441)

Add `meet_sent` field in `clusterNode` indicating the last time we sent
a MEET packet. Use this field to only (re-)send a MEET packet once every
handshake timeout period when detecting a node without an inbound link.

When receiving multiple MEET packets on the same link while the node is
in handshake state, instead of dropping the packet, we now simply
prevent the creation of a new node. This way we still process the MEET
packet's gossip and reply with a PONG as any other packets.

Improve some logging messages to include `human_nodename`. Add
`nodeExceedsHandshakeTimeout()` function.

This is a follow-up to this previous PR:
valkey-io#1307
And a partial fix to the crash described in:
valkey-io#1436

---------

Signed-off-by: Pierre Turin <[email protected]>
kronwerk pushed a commit to kronwerk/valkey that referenced this pull request Jan 27, 2025
…lkey-io#1535)

In valkey-io#1441, we found a assert, and decided remove this assert and instead
just free the newly created node and close the link, since if we cannot
get the IP from the link it probably means the connection was closed.
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
17847:M 19 Dec 2024 00:15:58.021 # === ASSERTION FAILED ===
17847:M 19 Dec 2024 00:15:58.021 # ==> cluster_legacy.c:3252 'nodeIp2String(node->ip, link, hdr->myip) == C_OK' is not true

------ STACK TRACE ------

17847 valkey-server *
src/valkey-server 127.0.0.1:27131 [cluster](clusterProcessPacket+0x1304) [0x4e5634]
src/valkey-server 127.0.0.1:27131 [cluster](clusterReadHandler+0x11e) [0x4e59de]
/__w/valkey/valkey/src/valkey-tls.so(+0x2f1e) [0x7f083983ff1e]
src/valkey-server 127.0.0.1:27131 [cluster](aeMain+0x8a) [0x41afea]
src/valkey-server 127.0.0.1:27131 [cluster](main+0x4d7) [0x40f547]
/lib64/libc.so.6(+0x40c8) [0x7f083985a0c8]
/lib64/libc.so.6(__libc_start_main+0x8b) [0x7f083985a18b]
src/valkey-server 127.0.0.1:27131 [cluster](_start+0x25) [0x410ef5]
```

But it also introduces another assert. The reason is that this new node
is not added to the cluster nodes dict.
```
17128:M 08 Jan 2025 10:51:44.061 # === ASSERTION FAILED ===
17128:M 08 Jan 2025 10:51:44.061 # ==> cluster_legacy.c:1693 'dictDelete(server.cluster->nodes, nodename) == DICT_OK' is not true

------ STACK TRACE ------

17128 valkey-server *
src/valkey-server 127.0.0.1:28627 [cluster][0x4ebdc4]
src/valkey-server 127.0.0.1:28627 [cluster][0x4e81d2]
src/valkey-server 127.0.0.1:28627 [cluster](clusterReadHandler+0x268)[0x4e8618]
/__w/valkey/valkey/src/valkey-tls.so(+0xb278)[0x7f109480b278]
src/valkey-server 127.0.0.1:28627 [cluster](aeMain+0x89)[0x592b09]
src/valkey-server 127.0.0.1:28627 [cluster](main+0x4b3)[0x453e23]
/lib64/libc.so.6(__libc_start_main+0xe5)[0x7f10958bf7e5]
src/valkey-server 127.0.0.1:28627 [cluster](_start+0x2e)[0x454a5e]
```

This closes valkey-io#1527.

Signed-off-by: Binbin <[email protected]>
kronwerk pushed a commit to kronwerk/valkey that referenced this pull request Jan 27, 2025
Update comments and log message in `cluster_legacy.c`.

Follow-up from valkey-io#1441.

Signed-off-by: Pierre Turin <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
Co-authored-by: Binbin <[email protected]>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Feb 1, 2025
Update comments and log message in `cluster_legacy.c`.

Follow-up from valkey-io#1441.

Signed-off-by: Pierre Turin <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
Co-authored-by: Binbin <[email protected]>
Signed-off-by: Harkrishn Patro <[email protected]>
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Feb 2, 2025
Update comments and log message in `cluster_legacy.c`.

Follow-up from valkey-io#1441.

Signed-off-by: Pierre Turin <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
Co-authored-by: Binbin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants