kv (ticdc): fix kvClient reconnection downhill loop (#10559) #10572
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an automated cherry-pick of #10559
What problem does this PR solve?
Issue Number: close #10239
What is changed and how it works?
eventFeedStream
struct to identify a stream and prevent it from being deleted unexpectedly.eventFeedStream
to prevent the stream from being canceled unexpectedly.s.deleteStream
to only once to prevent the stream from being deleted unexpectedly.Check List
Tests
time.sleep(5 * time.second)
before callings.deleteStream
in both unfixed and fixed version of code.unfixed cdc:
data:image/s3,"s3://crabby-images/88ae5/88ae54cacc0313abf787d0ccb629544310ed86bd" alt="img_v3_027f_9b5e20b7-2aab-44b7-8884-fc29d2797beg"
fixed cdc:
data:image/s3,"s3://crabby-images/dbb49/dbb49b1e5fa08a5264406d61c7418c7558e29cef" alt="image"
From the above graphs, it is evident that in the unfixed CDC, the lag of resolvedTs can exceed 12 minutes when a TiKV node is restarted. However, in the fixed CDC, the increase in resolvedTs is limited to a maximum of 35 seconds. This demonstrates the effectiveness of the fix.
Moreover, when the hard-coded
data:image/s3,"s3://crabby-images/2c7e5/2c7e528a77403afe22ccf04c85906f30ec9471d1" alt="image"
time.sleep(5 * time.second)
is removed and the fixed version of CDC is tested again, the lag becomes even smaller:Questions
Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?
Release note