-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove KV WAL #94
base: master
Are you sure you want to change the base?
remove KV WAL #94
Conversation
Signed-off-by: Jay Lee <[email protected]>
Signed-off-by: Jay Lee <[email protected]>
|
||
### Relation between sequence number and (region ID, apply index) | ||
|
||
In RocksDB, WAL plays the role of recording both data and order. MANIFEST plays the role of recording persistent data. Now that WAL is removed, we need a new component to replace its role. Raft logs already have all the necessary data and partial order within the same peer, and the order across peers is traced by region version. So all we need to do is record the exact index KV DB belongs to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should elaborate on the issue why need to record this. One can simply think just replaying from the previous apply index to the commit index is enough, without realizing the apply index is stored in RAFT CF of kvdb and different CFs are flushed separately.
|
||
To persist the relations, we need to send them back to the raftstore thread, along with `ApplyRes`. Noticing SN is strictly monotonically increased just like log index, we can store the relations just like raft entries. | ||
|
||
Because we need to use the relation to recover writes, so before a rocksdb flush finishes, we need to ensure the corresponding relation is persisted. When using separate RocksDB for each region, this is simple as we only care about one region at one flush. So just waiting for the SN to persist is enough. When sharing RocksDB for all regions, we need to ensure SN is persisted for all affected regions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to consider sharing RocksDB for regions? I think we can just deliver it after using separate RocksDB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new architecture will not be stable for quite some time.
|
||
For all other relations, they can be merged together and only keep the largest SN and largest apply index for each region ID. | ||
|
||
### How to recover from relation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let call it replay
instead of recover
|
||
After TiKV is restarted, it should check the maximum SN number for each CF, and choose the minimum SN as the replay start point. Similar to RocksDB, writes can be ignored partially if they contain the data to a CF with larger SN. | ||
|
||
Recovery should always recover the region with the smallest version. If the recovery changes the version, it should pause the current region and choose the new smallest version again. Region with the same version can be recovered with arbitrary order. The consideration is that version is the sync point between regions, operations such as split, merge, to smaller regions must happen earlier when their range overlap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recovery should always recover the region with the smallest version. If the recovery changes the version, it should pause the current region and choose the new smallest version again. Region with the same version can be recovered with arbitrary order. The consideration is that version is the sync point between regions, operations such as split, merge, to smaller regions must happen earlier when their range overlap. | |
Recovery should always recover the region with the smallest version. If the recovery is about to change the version, it should pause the current region and choose the new smallest version again. Region with the same version can be recovered with arbitrary order. The consideration is that version is the sync point between regions, operations such as split, merge, to smaller regions must happen earlier when their range overlap. |
|
||
KV RocksDB write rates can be more than 10kops, saving all relations can be a huge cost. The point of relation is to record the order of writes and get a suitable replay start point. It doesn’t require all relations to serve the goal. | ||
|
||
To get a suitable replay start point, only the maximum flushed SN needs to be considered. And only frozen memtables will be flushed, so every time a memtable is frozen, we can hint raftstore to the maximum SN this memtables contains. And before flushing the memtable, we wait for the SN relation to be persisted. There are existing hooks, EventListener, allowing us to make the modifications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And before flushing the memtable, we wait for the SN relation to be persisted.
It's not atomic, what if it restarts after the SN relation is persisted but the memtable hasn't been finished flushing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it will replay from a smaller apply index.
|
||
To build relations between SN and apply index, we need to know what SN is assigned to each write. RocksDB doesn’t have a public API to provide the information. Fortunately, exposing it is just a work of several lines. | ||
|
||
To persist the relations, we need to send them back to the raftstore thread, along with `ApplyRes`. Noticing SN is strictly monotonically increased just like log index, we can store the relations just like raft entries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the relation persisted, I mean, the detailed format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's an implementation details. It can be changed to best suit different engine types.
Rendered version