Split dual-channel COB overrun tests to separate servers #1374

naglera · 2024-11-28T14:16:38Z

The test isn't waiting long enough for the output buffer to overrun. This problem is happening because an error from the previous test is bleeding into the current test's logs. The simplest fix would be to split these tests.
Increased replication timeout to ensure sync fails due to output buffer overrun before a timeout occurs.

1. The test isn't waiting long enough for the output buffer to overrun. This problem is happening because an error from the previous test is bleeding into the current test's logs. The simplest fix would be to split these tests. 2. Increased replication timeout to ensure sync fails due to output buffer overrun before a timeout occurs. Solve valkey-io#1367 Signed-off-by: naglera <[email protected]>

naglera · 2024-11-28T14:16:57Z

Notice from the master test logs, the client which reached COB limit is client 11 which was created in Test dual-channel-replication primary gets cob overrun before established psync.
Master logs

### Starting test Test dual-channel-replication primary gets cob overrun during replica rdb load in tests/integration/dual-channel-replication.tcl
30734:M 27 Nov 2024 00:17:49.660 - Accepted 127.0.0.1:58710
30734:M 27 Nov 2024 00:17:49.752 * Replica 127.0.0.1:26735 asks for synchronization
30734:M 27 Nov 2024 00:17:49.752 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'e04efde3cfe1e016032376da2dd5052c3266ed73', my replication IDs are '527875139eea988abbbb8d77632f6a8697230a40' and '0000000000000000000000000000000000000000')
30734:M 27 Nov 2024 00:17:49.752 * <Dual Channel> Replica 127.0.0.1:26735 is capable of dual channel synchronization, and partial sync isn't possible. Full sync will continue with dedicated RDB channel.
30734:M 27 Nov 2024 00:17:49.761 # Client id=11 addr= laddr=127.0.0.1:26734 fd=14 name= age=0 idle=0 flags=SA db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1032 rbp=0 obl=0 oll=67 omem=1102016 tot-mem=1103952 events= cmd=sync user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=121 tot-net-out=5 tot-cmds=2 scheduled to be closed ASAP for overcoming of output buffer limits.
30734:M 27 Nov 2024 00:17:49.768 * <Dual Channel> Replica client id #11 rdb channel disconnected.
30734:M 27 Nov 2024 00:17:49.768 . <Dual Channel> Remove psync waiting replica client id #11 with cid 11 from replicas rax.
30734:M 27 Nov 2024 00:17:53.944 - DB 9: 244872 keys (0 volatile) in 262144 slots HT.
30734:M 27 Nov 2024 00:17:53.944 . Total: 5 clients connected (0 replicas), 124288424 (118.53M) bytes in use

codecov · 2024-11-28T14:31:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.62%. Comparing base (a939cb8) to head (f3bd992).
Report is 5 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1374      +/-   ##
============================================
+ Coverage     70.59%   70.62%   +0.02%     
============================================
  Files           117      117              
  Lines         63324    63324              
============================================
+ Hits          44704    44722      +18     
+ Misses        18620    18602      -18

see 10 files with indirect coverage changes

xbasel · 2024-11-28T14:56:08Z

tests/integration/dual-channel-replication.tcl

+    }
+}
+
+start_server {tags {"dual-channel-replication external:skip"}} {


Can restart_instance be used instead?

restart_instance could be used here, but splitting tests to different servers simplifies debugging by separating logs and avoids unnecessary test coupling. These tests don't require shared server state.

ranshid · 2024-11-28T17:21:13Z

is splitting the test really the only option? we can fetch the replica client ID and use it to search the log. I generally prefer we avoid spawning many servers if no real need

zuiderkwast

Good job! This test case has been haunting us for a while.

To start many servers take some time, so it's good to avoid it to keep total test time shorter. But in this case maybe it's not a problem. I also agree that separating them is better for isolation of tests where we check for log messages.

@ranshid @xbasel Do you want to try to find other alternatives?

enjoy-binbin · 2024-12-01T13:35:50Z

To start many servers take some time, so it's good to avoid it to keep total test time shorter. But in this case maybe it's not a problem. I also agree that separating them is better for isolation of tests where we check for log messages.

I also suffered this when dealing with #1297, i think split it is a good idea though it will increase the test time.

naglera requested a review from ranshid November 28, 2024 14:18

ranshid requested review from xbasel and removed request for ranshid November 28, 2024 14:19

xbasel reviewed Nov 28, 2024

View reviewed changes

xbasel approved these changes Nov 28, 2024

View reviewed changes

naglera changed the title ~~Split dual-channel output buffer overrun tests to separate servers~~ Split dual-channel COB overrun tests to separate servers Nov 28, 2024

zuiderkwast approved these changes Nov 28, 2024

View reviewed changes

enjoy-binbin approved these changes Nov 30, 2024

View reviewed changes

enjoy-binbin merged commit 7043ef0 into valkey-io:unstable Dec 1, 2024
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split dual-channel COB overrun tests to separate servers #1374

Split dual-channel COB overrun tests to separate servers #1374

naglera commented Nov 28, 2024

naglera commented Nov 28, 2024

codecov bot commented Nov 28, 2024 •

edited

Loading

xbasel Nov 28, 2024

naglera Nov 28, 2024

ranshid commented Nov 28, 2024

zuiderkwast left a comment

enjoy-binbin commented Dec 1, 2024

Split dual-channel COB overrun tests to separate servers #1374

Split dual-channel COB overrun tests to separate servers #1374

Conversation

naglera commented Nov 28, 2024

naglera commented Nov 28, 2024

codecov bot commented Nov 28, 2024 • edited Loading

Codecov Report

xbasel Nov 28, 2024

Choose a reason for hiding this comment

naglera Nov 28, 2024

Choose a reason for hiding this comment

ranshid commented Nov 28, 2024

zuiderkwast left a comment

Choose a reason for hiding this comment

enjoy-binbin commented Dec 1, 2024

codecov bot commented Nov 28, 2024 •

edited

Loading