-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only (re-)send MEET packet once every handshake timeout period #1441
Only (re-)send MEET packet once every handshake timeout period #1441
Conversation
629bfaa
to
e9b1b3c
Compare
@enjoy-binbin, @PingXie, please take a look at this PR when you have the chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I also thought about limiting it, but I thought we should be more robust to double packets.
Add meet_sent field in clusterNode indicating the last time we sent a MEET packet. Use this field to only (re-)send a MEET packet once every handshake timeout period. Improve some logging messages to include human_nodename. Add nodeExceedsHandshakeTimeout() function. Signed-off-by: Pierre Turin <[email protected]>
Signed-off-by: Pierre Turin <[email protected]>
When receiving multiple MEET packets on the same link while the node is in handshake state, instead of dropping the packet, we now simply prevent the creation of a new node. This way we still process the MEET packet's gossip and reply with a PONG as any other packets. Signed-off-by: Pierre Turin <[email protected]>
e9b1b3c
to
4f82271
Compare
In the last revision of this PR, I removed the code to drop the MEET packet from #1436 and replaced it with a check to prevent from creating a new node when receiving several MEET packets on the same link while the node is in handshake. @enjoy-binbin, @madolson, @PingXie, please tell me if you prefer this over dropping the packet entirely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall.
I'd prefer not dropping the second MEET. Not because the dropping itself is a concern but the fix is a bit ad-hoc. I feel that it is just a valid case and we should just handle it.
Another comment is on the new inbound_link_freed_time
field. I feel that it is not created at the right level (of abstraction). It should not be a concern of freeClusterLink
. I actually like the new meet_sent
more. What do you think?
ok, this LGTM and it was actually my first attempt, back then i mistakenly thought that repeated MEET would eliminate the handshake state, but now that I look at the code I think it is safe. |
I don't think we can get rid of Why do you think it should not be a concern of |
Address PR comments. Signed-off-by: Pierre Turin <[email protected]>
Node 1 might not be known from Node 0 if Node 2 does not gossip about it to Node 1. Signed-off-by: Pierre Turin <[email protected]>
Signed-off-by: Pierre Turin <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1441 +/- ##
============================================
+ Coverage 70.78% 70.87% +0.08%
============================================
Files 119 119
Lines 64691 64704 +13
============================================
+ Hits 45790 45856 +66
+ Misses 18901 18848 -53
|
Btw, I also agree with this. I mentioned on the other thread I think this is the better behavior, but people were seeing a lot of CI failures so I was okay with a temporary solution. |
The code coverage reports one missing line: Line 3253 in 0ccc5cc
This line should not be reached anymore since we are waiting for the handshake timeout before re-sending a MEET packet. |
On a second thought, I think a better name would be
|
I still think having this is not enough. We should still have
But we need to update I guess a way to get rid of With this, we could replace the condition with something like:
What do you think @PingXie? |
I had a talk offline with @PingXie, and we think we can change the condition to send a MEET packet to something like:
And with that we should be able to get rid of |
A crash was detected recently during the daily runs: https://github.com/valkey-io/valkey/actions/runs/12403569355/job/34627291048#step:8:3854
My proposal after talking with @madolson is to remove this assert and instead just remove the newly created node and close the link, since if we cannot get the IP from the link it probably means the connection was closed. Line 3274 in 1c97317
|
Doing the solution with |
When we cannot get the IP address from a link, instead of asserting we now free the link and delete the node. Signed-off-by: Pierre Turin <[email protected]>
Signed-off-by: Pierre Turin <[email protected]>
Signed-off-by: Pierre Turin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor wording changes in the code comments and logs. Thanks @pieturin!
I think I know why. The
I think this is just another argument for #1471. |
In valkey-io#1441, we found a assert, and decided remove this assert and instead just free the newly created node and close the link, since if we cannot get the IP from the link it probably means the connection was closed. ``` === VALKEY BUG REPORT START: Cut & paste starting from here === 17847:M 19 Dec 2024 00:15:58.021 # === ASSERTION FAILED === 17847:M 19 Dec 2024 00:15:58.021 # ==> cluster_legacy.c:3252 'nodeIp2String(node->ip, link, hdr->myip) == C_OK' is not true ------ STACK TRACE ------ 17847 valkey-server * src/valkey-server 127.0.0.1:27131 [cluster](clusterProcessPacket+0x1304) [0x4e5634] src/valkey-server 127.0.0.1:27131 [cluster](clusterReadHandler+0x11e) [0x4e59de] /__w/valkey/valkey/src/valkey-tls.so(+0x2f1e) [0x7f083983ff1e] src/valkey-server 127.0.0.1:27131 [cluster](aeMain+0x8a) [0x41afea] src/valkey-server 127.0.0.1:27131 [cluster](main+0x4d7) [0x40f547] /lib64/libc.so.6(+0x40c8) [0x7f083985a0c8] /lib64/libc.so.6(__libc_start_main+0x8b) [0x7f083985a18b] src/valkey-server 127.0.0.1:27131 [cluster](_start+0x25) [0x410ef5] ``` But it also introduces another assert. The reason is that this new node is not added to the cluster nodes dict, we should just free it. ``` 17128:M 08 Jan 2025 10:51:44.061 # === ASSERTION FAILED === 17128:M 08 Jan 2025 10:51:44.061 # ==> cluster_legacy.c:1693 'dictDelete(server.cluster->nodes, nodename) == DICT_OK' is not true ------ STACK TRACE ------ 17128 valkey-server * src/valkey-server 127.0.0.1:28627 [cluster][0x4ebdc4] src/valkey-server 127.0.0.1:28627 [cluster][0x4e81d2] src/valkey-server 127.0.0.1:28627 [cluster](clusterReadHandler+0x268)[0x4e8618] /__w/valkey/valkey/src/valkey-tls.so(+0xb278)[0x7f109480b278] src/valkey-server 127.0.0.1:28627 [cluster](aeMain+0x89)[0x592b09] src/valkey-server 127.0.0.1:28627 [cluster](main+0x4b3)[0x453e23] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f10958bf7e5] src/valkey-server 127.0.0.1:28627 [cluster](_start+0x2e)[0x454a5e] ``` This closes valkey-io#1527. Signed-off-by: Binbin <[email protected]>
) In #1441, we found a assert, and decided remove this assert and instead just free the newly created node and close the link, since if we cannot get the IP from the link it probably means the connection was closed. ``` === VALKEY BUG REPORT START: Cut & paste starting from here === 17847:M 19 Dec 2024 00:15:58.021 # === ASSERTION FAILED === 17847:M 19 Dec 2024 00:15:58.021 # ==> cluster_legacy.c:3252 'nodeIp2String(node->ip, link, hdr->myip) == C_OK' is not true ------ STACK TRACE ------ 17847 valkey-server * src/valkey-server 127.0.0.1:27131 [cluster](clusterProcessPacket+0x1304) [0x4e5634] src/valkey-server 127.0.0.1:27131 [cluster](clusterReadHandler+0x11e) [0x4e59de] /__w/valkey/valkey/src/valkey-tls.so(+0x2f1e) [0x7f083983ff1e] src/valkey-server 127.0.0.1:27131 [cluster](aeMain+0x8a) [0x41afea] src/valkey-server 127.0.0.1:27131 [cluster](main+0x4d7) [0x40f547] /lib64/libc.so.6(+0x40c8) [0x7f083985a0c8] /lib64/libc.so.6(__libc_start_main+0x8b) [0x7f083985a18b] src/valkey-server 127.0.0.1:27131 [cluster](_start+0x25) [0x410ef5] ``` But it also introduces another assert. The reason is that this new node is not added to the cluster nodes dict. ``` 17128:M 08 Jan 2025 10:51:44.061 # === ASSERTION FAILED === 17128:M 08 Jan 2025 10:51:44.061 # ==> cluster_legacy.c:1693 'dictDelete(server.cluster->nodes, nodename) == DICT_OK' is not true ------ STACK TRACE ------ 17128 valkey-server * src/valkey-server 127.0.0.1:28627 [cluster][0x4ebdc4] src/valkey-server 127.0.0.1:28627 [cluster][0x4e81d2] src/valkey-server 127.0.0.1:28627 [cluster](clusterReadHandler+0x268)[0x4e8618] /__w/valkey/valkey/src/valkey-tls.so(+0xb278)[0x7f109480b278] src/valkey-server 127.0.0.1:28627 [cluster](aeMain+0x89)[0x592b09] src/valkey-server 127.0.0.1:28627 [cluster](main+0x4b3)[0x453e23] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f10958bf7e5] src/valkey-server 127.0.0.1:28627 [cluster](_start+0x2e)[0x454a5e] ``` This closes #1527. Signed-off-by: Binbin <[email protected]>
Update comments and log message in `cluster_legacy.c`. Follow-up from #1441. Signed-off-by: Pierre Turin <[email protected]> Co-authored-by: Ping Xie <[email protected]> Co-authored-by: Binbin <[email protected]>
…lkey-io#1535) In valkey-io#1441, we found a assert, and decided remove this assert and instead just free the newly created node and close the link, since if we cannot get the IP from the link it probably means the connection was closed. ``` === VALKEY BUG REPORT START: Cut & paste starting from here === 17847:M 19 Dec 2024 00:15:58.021 # === ASSERTION FAILED === 17847:M 19 Dec 2024 00:15:58.021 # ==> cluster_legacy.c:3252 'nodeIp2String(node->ip, link, hdr->myip) == C_OK' is not true ------ STACK TRACE ------ 17847 valkey-server * src/valkey-server 127.0.0.1:27131 [cluster](clusterProcessPacket+0x1304) [0x4e5634] src/valkey-server 127.0.0.1:27131 [cluster](clusterReadHandler+0x11e) [0x4e59de] /__w/valkey/valkey/src/valkey-tls.so(+0x2f1e) [0x7f083983ff1e] src/valkey-server 127.0.0.1:27131 [cluster](aeMain+0x8a) [0x41afea] src/valkey-server 127.0.0.1:27131 [cluster](main+0x4d7) [0x40f547] /lib64/libc.so.6(+0x40c8) [0x7f083985a0c8] /lib64/libc.so.6(__libc_start_main+0x8b) [0x7f083985a18b] src/valkey-server 127.0.0.1:27131 [cluster](_start+0x25) [0x410ef5] ``` But it also introduces another assert. The reason is that this new node is not added to the cluster nodes dict. ``` 17128:M 08 Jan 2025 10:51:44.061 # === ASSERTION FAILED === 17128:M 08 Jan 2025 10:51:44.061 # ==> cluster_legacy.c:1693 'dictDelete(server.cluster->nodes, nodename) == DICT_OK' is not true ------ STACK TRACE ------ 17128 valkey-server * src/valkey-server 127.0.0.1:28627 [cluster][0x4ebdc4] src/valkey-server 127.0.0.1:28627 [cluster][0x4e81d2] src/valkey-server 127.0.0.1:28627 [cluster](clusterReadHandler+0x268)[0x4e8618] /__w/valkey/valkey/src/valkey-tls.so(+0xb278)[0x7f109480b278] src/valkey-server 127.0.0.1:28627 [cluster](aeMain+0x89)[0x592b09] src/valkey-server 127.0.0.1:28627 [cluster](main+0x4b3)[0x453e23] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f10958bf7e5] src/valkey-server 127.0.0.1:28627 [cluster](_start+0x2e)[0x454a5e] ``` This closes valkey-io#1527. Signed-off-by: Binbin <[email protected]> Signed-off-by: proost <[email protected]>
Update comments and log message in `cluster_legacy.c`. Follow-up from valkey-io#1441. Signed-off-by: Pierre Turin <[email protected]> Co-authored-by: Ping Xie <[email protected]> Co-authored-by: Binbin <[email protected]> Signed-off-by: proost <[email protected]>
…y-io#1441) Add `meet_sent` field in `clusterNode` indicating the last time we sent a MEET packet. Use this field to only (re-)send a MEET packet once every handshake timeout period when detecting a node without an inbound link. When receiving multiple MEET packets on the same link while the node is in handshake state, instead of dropping the packet, we now simply prevent the creation of a new node. This way we still process the MEET packet's gossip and reply with a PONG as any other packets. Improve some logging messages to include `human_nodename`. Add `nodeExceedsHandshakeTimeout()` function. This is a follow-up to this previous PR: valkey-io#1307 And a partial fix to the crash described in: valkey-io#1436 --------- Signed-off-by: Pierre Turin <[email protected]>
…lkey-io#1535) In valkey-io#1441, we found a assert, and decided remove this assert and instead just free the newly created node and close the link, since if we cannot get the IP from the link it probably means the connection was closed. ``` === VALKEY BUG REPORT START: Cut & paste starting from here === 17847:M 19 Dec 2024 00:15:58.021 # === ASSERTION FAILED === 17847:M 19 Dec 2024 00:15:58.021 # ==> cluster_legacy.c:3252 'nodeIp2String(node->ip, link, hdr->myip) == C_OK' is not true ------ STACK TRACE ------ 17847 valkey-server * src/valkey-server 127.0.0.1:27131 [cluster](clusterProcessPacket+0x1304) [0x4e5634] src/valkey-server 127.0.0.1:27131 [cluster](clusterReadHandler+0x11e) [0x4e59de] /__w/valkey/valkey/src/valkey-tls.so(+0x2f1e) [0x7f083983ff1e] src/valkey-server 127.0.0.1:27131 [cluster](aeMain+0x8a) [0x41afea] src/valkey-server 127.0.0.1:27131 [cluster](main+0x4d7) [0x40f547] /lib64/libc.so.6(+0x40c8) [0x7f083985a0c8] /lib64/libc.so.6(__libc_start_main+0x8b) [0x7f083985a18b] src/valkey-server 127.0.0.1:27131 [cluster](_start+0x25) [0x410ef5] ``` But it also introduces another assert. The reason is that this new node is not added to the cluster nodes dict. ``` 17128:M 08 Jan 2025 10:51:44.061 # === ASSERTION FAILED === 17128:M 08 Jan 2025 10:51:44.061 # ==> cluster_legacy.c:1693 'dictDelete(server.cluster->nodes, nodename) == DICT_OK' is not true ------ STACK TRACE ------ 17128 valkey-server * src/valkey-server 127.0.0.1:28627 [cluster][0x4ebdc4] src/valkey-server 127.0.0.1:28627 [cluster][0x4e81d2] src/valkey-server 127.0.0.1:28627 [cluster](clusterReadHandler+0x268)[0x4e8618] /__w/valkey/valkey/src/valkey-tls.so(+0xb278)[0x7f109480b278] src/valkey-server 127.0.0.1:28627 [cluster](aeMain+0x89)[0x592b09] src/valkey-server 127.0.0.1:28627 [cluster](main+0x4b3)[0x453e23] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f10958bf7e5] src/valkey-server 127.0.0.1:28627 [cluster](_start+0x2e)[0x454a5e] ``` This closes valkey-io#1527. Signed-off-by: Binbin <[email protected]>
Update comments and log message in `cluster_legacy.c`. Follow-up from valkey-io#1441. Signed-off-by: Pierre Turin <[email protected]> Co-authored-by: Ping Xie <[email protected]> Co-authored-by: Binbin <[email protected]>
Update comments and log message in `cluster_legacy.c`. Follow-up from valkey-io#1441. Signed-off-by: Pierre Turin <[email protected]> Co-authored-by: Ping Xie <[email protected]> Co-authored-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>
Update comments and log message in `cluster_legacy.c`. Follow-up from valkey-io#1441. Signed-off-by: Pierre Turin <[email protected]> Co-authored-by: Ping Xie <[email protected]> Co-authored-by: Binbin <[email protected]>
Add
meet_sent
field inclusterNode
indicating the last time we sent a MEET packet. Use this field to only (re-)send a MEET packet once every handshake timeout period when detecting a node without an inbound link.When receiving multiple MEET packets on the same link while the node is in handshake state, instead of dropping the packet, we now simply prevent the creation of a new node. This way we still process the MEET packet's gossip and reply with a PONG as any other packets.
Improve some logging messages to include
human_nodename
. AddnodeExceedsHandshakeTimeout()
function.This is a follow-up to this previous PR: #1307
And a partial fix to the crash described in: #1436