remove KV WAL #94

BusyJay · 2022-06-02T23:39:41Z

Signed-off-by: Jay Lee <[email protected]>

Connor1996 · 2022-08-01T07:03:11Z

text/0094-remove-kv-wal.md

+
+### Relation between sequence number and (region ID, apply index)
+
+In RocksDB, WAL plays the role of recording both data and order. MANIFEST plays the role of recording persistent data. Now that WAL is removed, we need a new component to replace its role. Raft logs already have all the necessary data and partial order within the same peer, and the order across peers is traced by region version. So all we need to do is record the exact index KV DB belongs to.


Should elaborate on the issue why need to record this. One can simply think just replaying from the previous apply index to the commit index is enough, without realizing the apply index is stored in RAFT CF of kvdb and different CFs are flushed separately.

Connor1996 · 2022-08-01T07:21:59Z

text/0094-remove-kv-wal.md

+
+To persist the relations, we need to send them back to the raftstore thread, along with `ApplyRes`. Noticing SN is strictly monotonically increased just like log index, we can store the relations just like raft entries.
+
+Because we need to use the relation to recover writes, so before a rocksdb flush finishes, we need to ensure the corresponding relation is persisted. When using separate RocksDB for each region, this is simple as we only care about one region at one flush. So just waiting for the SN to persist is enough. When sharing RocksDB for all regions, we need to ensure SN is persisted for all affected regions.


Why do we need to consider sharing RocksDB for regions? I think we can just deliver it after using separate RocksDB

The new architecture will not be stable for quite some time.

Connor1996 · 2022-08-01T07:24:54Z

text/0094-remove-kv-wal.md

+
+For all other relations, they can be merged together and only keep the largest SN and largest apply index for each region ID.
+
+### How to recover from relation


let call it replay instead of recover

Connor1996 · 2022-08-01T07:25:40Z

text/0094-remove-kv-wal.md

+
+After TiKV is restarted, it should check the maximum SN number for each CF, and choose the minimum SN as the replay start point. Similar to RocksDB, writes can be ignored partially if they contain the data to a CF with larger SN.
+
+Recovery should always recover the region with the smallest version. If the recovery changes the version, it should pause the current region and choose the new smallest version again. Region with the same version can be recovered with arbitrary order. The consideration is that version is the sync point between regions, operations such as split, merge, to smaller regions must happen earlier when their range overlap.


Suggested change

Recovery should always recover the region with the smallest version. If the recovery changes the version, it should pause the current region and choose the new smallest version again. Region with the same version can be recovered with arbitrary order. The consideration is that version is the sync point between regions, operations such as split, merge, to smaller regions must happen earlier when their range overlap.

Recovery should always recover the region with the smallest version. If the recovery is about to change the version, it should pause the current region and choose the new smallest version again. Region with the same version can be recovered with arbitrary order. The consideration is that version is the sync point between regions, operations such as split, merge, to smaller regions must happen earlier when their range overlap.

Connor1996 · 2022-08-01T07:39:35Z

text/0094-remove-kv-wal.md

+
+KV RocksDB write rates can be more than 10kops, saving all relations can be a huge cost. The point of relation is to record the order of writes and get a suitable replay start point. It doesn’t require all relations to serve the goal.
+
+To get a suitable replay start point, only the maximum flushed SN needs to be considered. And only frozen memtables will be flushed, so every time a memtable is frozen, we can hint raftstore to the maximum SN this memtables contains. And before flushing the memtable, we wait for the SN relation to be persisted. There are existing hooks, EventListener, allowing us to make the modifications.


And before flushing the memtable, we wait for the SN relation to be persisted. It's not atomic, what if it restarts after the SN relation is persisted but the memtable hasn't been finished flushing?

Then it will replay from a smaller apply index.

Connor1996 · 2022-08-01T07:58:57Z

text/0094-remove-kv-wal.md

+
+To build relations between SN and apply index, we need to know what SN is assigned to each write. RocksDB doesn’t have a public API to provide the information. Fortunately, exposing it is just a work of several lines.
+
+To persist the relations, we need to send them back to the raftstore thread, along with `ApplyRes`. Noticing SN is strictly monotonically increased just like log index, we can store the relations just like raft entries.


How is the relation persisted, I mean, the detailed format

I think it's an implementation details. It can be changed to best suit different engine types.

BusyJay added 2 commits June 2, 2022 16:39

add remove WAL proposal

c841077

Signed-off-by: Jay Lee <[email protected]>

update number

b469a8e

Signed-off-by: Jay Lee <[email protected]>

5kbpers mentioned this pull request Jun 24, 2022

Remove KV WAL tikv/tikv#12901

Open

BusyJay mentioned this pull request Jul 19, 2022

physical isolation between region #93

Merged

Connor1996 reviewed Aug 1, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove KV WAL #94

remove KV WAL #94

BusyJay commented Jun 2, 2022 •

edited

Loading

Connor1996 Aug 1, 2022

Connor1996 Aug 1, 2022

BusyJay Aug 1, 2022

Connor1996 Aug 1, 2022

Connor1996 Aug 1, 2022

Connor1996 Aug 1, 2022

BusyJay Aug 1, 2022

Connor1996 Aug 1, 2022

BusyJay Aug 1, 2022


		### Relation between sequence number and (region ID, apply index)

		In RocksDB, WAL plays the role of recording both data and order. MANIFEST plays the role of recording persistent data. Now that WAL is removed, we need a new component to replace its role. Raft logs already have all the necessary data and partial order within the same peer, and the order across peers is traced by region version. So all we need to do is record the exact index KV DB belongs to.


		To persist the relations, we need to send them back to the raftstore thread, along with `ApplyRes`. Noticing SN is strictly monotonically increased just like log index, we can store the relations just like raft entries.

		Because we need to use the relation to recover writes, so before a rocksdb flush finishes, we need to ensure the corresponding relation is persisted. When using separate RocksDB for each region, this is simple as we only care about one region at one flush. So just waiting for the SN to persist is enough. When sharing RocksDB for all regions, we need to ensure SN is persisted for all affected regions.


		For all other relations, they can be merged together and only keep the largest SN and largest apply index for each region ID.

		### How to recover from relation


		After TiKV is restarted, it should check the maximum SN number for each CF, and choose the minimum SN as the replay start point. Similar to RocksDB, writes can be ignored partially if they contain the data to a CF with larger SN.

		Recovery should always recover the region with the smallest version. If the recovery changes the version, it should pause the current region and choose the new smallest version again. Region with the same version can be recovered with arbitrary order. The consideration is that version is the sync point between regions, operations such as split, merge, to smaller regions must happen earlier when their range overlap.


		KV RocksDB write rates can be more than 10kops, saving all relations can be a huge cost. The point of relation is to record the order of writes and get a suitable replay start point. It doesn’t require all relations to serve the goal.

		To get a suitable replay start point, only the maximum flushed SN needs to be considered. And only frozen memtables will be flushed, so every time a memtable is frozen, we can hint raftstore to the maximum SN this memtables contains. And before flushing the memtable, we wait for the SN relation to be persisted. There are existing hooks, EventListener, allowing us to make the modifications.


		To build relations between SN and apply index, we need to know what SN is assigned to each write. RocksDB doesn’t have a public API to provide the information. Fortunately, exposing it is just a work of several lines.

		To persist the relations, we need to send them back to the raftstore thread, along with `ApplyRes`. Noticing SN is strictly monotonically increased just like log index, we can store the relations just like raft entries.

remove KV WAL #94

Are you sure you want to change the base?

remove KV WAL #94

Conversation

BusyJay commented Jun 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BusyJay commented Jun 2, 2022 •

edited

Loading