Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split dual-channel COB overrun tests to separate servers #1374

Merged
merged 1 commit into from
Dec 1, 2024

Conversation

naglera
Copy link
Contributor

@naglera naglera commented Nov 28, 2024

  1. The test isn't waiting long enough for the output buffer to overrun. This problem is happening because an error from the previous test is bleeding into the current test's logs. The simplest fix would be to split these tests.
  2. Increased replication timeout to ensure sync fails due to output buffer overrun before a timeout occurs.

Fixes #1367

1. The test isn't waiting long enough for the output buffer to overrun. This problem is happening because an error from the previous test is bleeding into the current test's logs. The simplest fix would be to split these tests.
2. Increased replication timeout to ensure sync fails due to output buffer overrun before a timeout occurs.

Solve valkey-io#1367

Signed-off-by: naglera <[email protected]>
@naglera
Copy link
Contributor Author

naglera commented Nov 28, 2024

Notice from the master test logs, the client which reached COB limit is client 11 which was created in Test dual-channel-replication primary gets cob overrun before established psync.
Master logs

### Starting test Test dual-channel-replication primary gets cob overrun during replica rdb load in tests/integration/dual-channel-replication.tcl
30734:M 27 Nov 2024 00:17:49.660 - Accepted 127.0.0.1:58710
30734:M 27 Nov 2024 00:17:49.752 * Replica 127.0.0.1:26735 asks for synchronization
30734:M 27 Nov 2024 00:17:49.752 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'e04efde3cfe1e016032376da2dd5052c3266ed73', my replication IDs are '527875139eea988abbbb8d77632f6a8697230a40' and '0000000000000000000000000000000000000000')
30734:M 27 Nov 2024 00:17:49.752 * <Dual Channel> Replica 127.0.0.1:26735 is capable of dual channel synchronization, and partial sync isn't possible. Full sync will continue with dedicated RDB channel.
30734:M 27 Nov 2024 00:17:49.761 # Client id=11 addr= laddr=127.0.0.1:26734 fd=14 name= age=0 idle=0 flags=SA db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1032 rbp=0 obl=0 oll=67 omem=1102016 tot-mem=1103952 events= cmd=sync user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=121 tot-net-out=5 tot-cmds=2 scheduled to be closed ASAP for overcoming of output buffer limits.
30734:M 27 Nov 2024 00:17:49.768 * <Dual Channel> Replica client id #11 rdb channel disconnected.
30734:M 27 Nov 2024 00:17:49.768 . <Dual Channel> Remove psync waiting replica client id #11 with cid 11 from replicas rax.
30734:M 27 Nov 2024 00:17:53.944 - DB 9: 244872 keys (0 volatile) in 262144 slots HT.
30734:M 27 Nov 2024 00:17:53.944 . Total: 5 clients connected (0 replicas), 124288424 (118.53M) bytes in use

@naglera naglera requested a review from ranshid November 28, 2024 14:18
@ranshid ranshid requested review from xbasel and removed request for ranshid November 28, 2024 14:19
Copy link

codecov bot commented Nov 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.62%. Comparing base (a939cb8) to head (f3bd992).
Report is 5 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1374      +/-   ##
============================================
+ Coverage     70.59%   70.62%   +0.02%     
============================================
  Files           117      117              
  Lines         63324    63324              
============================================
+ Hits          44704    44722      +18     
+ Misses        18620    18602      -18     

see 10 files with indirect coverage changes

}
}

start_server {tags {"dual-channel-replication external:skip"}} {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can restart_instance be used instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restart_instance could be used here, but splitting tests to different servers simplifies debugging by separating logs and avoids unnecessary test coupling. These tests don't require shared server state.

@naglera naglera changed the title Split dual-channel output buffer overrun tests to separate servers Split dual-channel COB overrun tests to separate servers Nov 28, 2024
@ranshid
Copy link
Member

ranshid commented Nov 28, 2024

is splitting the test really the only option? we can fetch the replica client ID and use it to search the log. I generally prefer we avoid spawning many servers if no real need

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! This test case has been haunting us for a while.

To start many servers take some time, so it's good to avoid it to keep total test time shorter. But in this case maybe it's not a problem. I also agree that separating them is better for isolation of tests where we check for log messages.

@ranshid @xbasel Do you want to try to find other alternatives?

@enjoy-binbin enjoy-binbin merged commit 7043ef0 into valkey-io:unstable Dec 1, 2024
47 checks passed
@enjoy-binbin
Copy link
Member

To start many servers take some time, so it's good to avoid it to keep total test time shorter. But in this case maybe it's not a problem. I also agree that separating them is better for isolation of tests where we check for log messages.

I also suffered this when dealing with #1297, i think split it is a good idea though it will increase the test time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TEST FAILURE] InTest dual-channel-replication primary gets cob overrun during replica rdb load
5 participants