-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scoped RDB loading context and immediate abort flag #1173
base: unstable
Are you sure you want to change the base?
Conversation
acedb47
to
4aee158
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #1173 +/- ##
============================================
+ Coverage 70.66% 70.70% +0.04%
============================================
Files 114 115 +1
Lines 63150 63164 +14
============================================
+ Hits 44626 44662 +36
+ Misses 18524 18502 -22
|
General comment: Basically we would like to have a method to tell the current load process to stop ASAP. This can also be achieved by adding an RIO flag (eg #define RIO_FLAG_STOP_ASAP (1 << 2)) and have rio check for this flag when it is performing different io operations. the only issue is that the rdb RIO is local to the rdbLoadRioWithCtx. we can, however keep a pointer in the server to the current active loading rio so that in any point during load we can set the flag RIO_FLAG_STOP_ASAP on the current loading rio. IMO this would be cleaner. |
src/server.h
Outdated
@@ -2038,6 +2038,7 @@ struct valkeyServer { | |||
long long reploff; | |||
long long read_reploff; | |||
int dbid; | |||
uint64_t close_asap : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we piggyback on the existing state variables to detect when the sync has been aborted / primary connection dropped? Since cancelReplicationHandshake
is called when the connection is dropped and updates:
server.repl_rdb_channel_state = REPL_DUAL_CHANNEL_STATE_NONE;
Can't we simply use server.repl_rdb_channel_state
in rdbLoadProgressCallback
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably won't work and will create an issue when RDB dual channel isn't used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
server.repl_rdb_channel_state
will be equal to REPL_DUAL_CHANNEL_STATE_NONE
also when dual channel is disabled
src/rdb.c
Outdated
if (server.repl_provisional_primary.close_asap == 1) { | ||
serverLog(LL_WARNING, "Primary main connection dropped during RDB load callback"); | ||
return -1; | ||
} | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm not following why we can't null out the connections here and use that instead of a new close_asap flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would still require to add logic to the rioConnRead/Write right? We could flag the rdb with RIO_FLAG_READ_ERROR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Null out the fields will look the same as running sync with dual channel disabled.
It would still require to add logic to the rioConnRead/Write right?
right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could flag the rdb with RIO_FLAG_READ_ERROR
While using RIO flags could potentially offer a cleaner solution, it does present its own challenges. As you mentioned, we would need to maintain a pointer in the server to the current active loading RIO. This would require adding and maintaining a currently_loading_rdb
field in the server struct.
This approach, while potentially more flexible for future use cases, introduces additional complexity and state management. It would require changes in multiple parts of the RDB load process to properly set, use, and clear this pointer.
The current proposed solution, while more specific to this use case, has the advantage of being more localized and doesn't require global state management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose a third option is to remove the event handler for the replication connection before calling process events and then reinstalling it after the fact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@madolson I am not sure I understand this proposal. We need to process the events while we are loading in order to keep feeding the local replication buffer AFAIK. We could (as a third option) do nothing when we identify the replication link was broken and complete the load (or let it disconnect as well), however I do feel that having the ability to bail out from a load is something we might find handy in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This idea made sense to me when I posted it but reading back it doesn't make sense, I might have just been missing something. More generally I want to move away from doing recursive calls for handling processing events, and it that world we can just skip it, but that is likely a much larger change than what we want to do here.
a7aac51
to
6f9d737
Compare
…onnection handling Introduces a dedicated flag in provisional primary struct to signal immediate abort, preventing potential use-after-free scenarios during replication disconnection in dual-channel load. This ensures proper termination of rdbLoadRioWithLoadingCtx when replication is cancelled due to connection loss on main connection. Fixes valkey-io#1152 Signed-off-by: naglera <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
- Add test to consistently reproduce rdb load callback crash - Avoid checking close_asap when no data was processed Signed-off-by: naglera <[email protected]>
…ion disconnection handling" This reverts commit b873d41. Signed-off-by: naglera <[email protected]>
This commit introduces a new mechanism for temporarily changing the server's loading_rio context during RDB loading operations. The new RDB_SCOPED_LOADING_RIO macro allows for a scoped change of the server.loading_rio value, ensuring that it's automatically restored to its original value when the scope ends. Signed-off-by: naglera <[email protected]>
d5e83f6
to
9849350
Compare
Signed-off-by: naglera <[email protected]>
6cc4f5e
to
eba00eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @naglera looks promising! I like scoped actions, but I only want to make sure about the compiler support is not compromised.
BTW if it is not, we can consider having a generic ScopeGuard macro in Valkey
src/replication.c
Outdated
@@ -2833,6 +2833,7 @@ int readIntoReplDataBlock(connection *conn, replDataBufBlock *data_block, size_t | |||
} | |||
if (nread == 0) { | |||
serverLog(LL_VERBOSE, "Provisional primary closed connection"); | |||
if (server.loading_rio) server.loading_rio->flags |= RIO_FLAG_CLOSE_ASAP; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe add a short comment explaining why we are marking it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also see my comment later about encapsulate the flagging in a dedicated function
@@ -168,8 +169,13 @@ static inline int rioGetWriteError(rio *r) { | |||
return (r->flags & RIO_FLAG_WRITE_ERROR) != 0; | |||
} | |||
|
|||
/* Like rioGetReadError() but for async close errors. */ | |||
static inline int rioGetAsyncCloseError(rio *r) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not used anywhere. suggest to drop it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rioGetWriteError
is not used anywhere either. This would maintain symmetry in our error handling capabilities for read, write, and async close operations, even if they're not currently utilized.
src/replication.c
Outdated
@@ -2833,6 +2833,7 @@ int readIntoReplDataBlock(connection *conn, replDataBufBlock *data_block, size_t | |||
} | |||
if (nread == 0) { | |||
serverLog(LL_VERBOSE, "Provisional primary closed connection"); | |||
if (server.loading_rio) server.loading_rio->flags |= RIO_FLAG_CLOSE_ASAP; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe for the sake of encapsulation we can have a function to encapsulate the flagging logic? like rioCloseASAP(rio *rdb)
src/rdb.h
Outdated
|
||
/* Macro to temporarily set server.loading_rio within a scope. */ | ||
#define RDB_SCOPED_LOADING_RIO(new_rio) \ | ||
__attribute__((cleanup(_restore_loading_rio))) rio *_old_rio __attribute__((unused)) = server.loading_rio; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice IMO. only thing is that I do not recall our compiler support scope in Valkey (@zuiderkwast DYK?)... for example is MSVC supported? I guess if we do we can just place a cleanup logic on the only 2 places where we return from the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You raise a valid point about compiler support. Regarding the return points, currently rdbLoadWithLoadingCtx
has about 5 return points, and this number may increase in the future as the function evolves. If we don't enforce a single return path, and not use scope based variable, we risk introducing bugs in the future where the cleanup logic might be missed on new return paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other way to handle this is just to split the function, so we move the rest of the logic into another function that we call and can make sure it's correctly restored on return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My ask would be that we split the discussion, let's do it manually for now and then open a separate issue about supporting this? I think generally we should have a general macro for this as well, not just one specific for the server.loading_rio.
wait_for_condition 500 1000 { | ||
[string match "*slave*,state=wait_bgsave*,type=rdb-channel*" [$primary info replication]] && | ||
[string match "*slave*,state=bg_transfer*,type=main-channel*" [$primary info replication]] && | ||
[s -1 rdb_bgsave_in_progress] eq 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should wait to see that the sync was successful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use rdb-key-save-delay
in this test, which intentionally slows down the RDB saving process. Due to potential context switches, the sync time can be unpredictable and might take longer than expected. This unpredictability could lead to test flakyness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could reduce it to 0 and wait
Co-authored-by: ranshid <[email protected]> Signed-off-by: Amit Nagler <[email protected]>
Signed-off-by: naglera <[email protected]>
src/replication.c
Outdated
@@ -2833,6 +2833,8 @@ int readIntoReplDataBlock(connection *conn, replDataBufBlock *data_block, size_t | |||
} | |||
if (nread == 0) { | |||
serverLog(LL_VERBOSE, "Provisional primary closed connection"); | |||
/* Signal ongoing RDB load to terminate gracefully */ | |||
if (server.loading_rio) rioCloseASAP(server.loading_rio); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be invoked on line 2832 as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, incase of connection state changes
Signed-off-by: naglera <[email protected]>
Signed-off-by: naglera <[email protected]>
Signed-off-by: naglera <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve in order to indicate this generally looks fine to me. We do need to decide on the cleanup attribute use which I think is mostly supported with some exceptions. (At least we would probably get compilation error IMO)
Signed-off-by: Madelyn Olson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good to me.
|
||
$primary config set repl-diskless-sync yes | ||
$primary config set dual-channel-replication-enabled yes | ||
$primary config set loglevel debug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be set to debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It helped while writing the test. We don't need it anymore
set primary [srv 0 client] | ||
set primary_host [srv 0 host] | ||
set primary_port [srv 0 port] | ||
set loglines [count_log_lines 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You set this value later, so it has no impact here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use it between the two sets for
wait_for_log_messages 0 {"*Loading RDB produced by Valkey version*"} $loglines 1000 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you set it again on line 1228. You actually set it three times.
Signed-off-by: naglera <[email protected]>
Signed-off-by: naglera <[email protected]>
…plica while syncing (only expect it to be eventually connected) Signed-off-by: naglera <[email protected]>
Signed-off-by: naglera <[email protected]>
This PR introduces a new mechanism for temporarily changing the
server's loading_rio context during RDB loading operations. The new
RDB_SCOPED_LOADING_RIO
macro allows for a scoped change of theserver.loading_rio
value, ensuring that it's automatically restoredto its original value when the scope ends.
Introduces a dedicated flag to
rio
to signal immediate abort, preventingpotential use-after-free scenarios during replication disconnection in
dual-channel load. This ensures proper termination of
rdbLoadRioWithLoadingCtx
when replication is cancelled due to connection loss on main connection.
Fixes #1152