-
Notifications
You must be signed in to change notification settings - Fork 59
PerconaFT Engine Architecture
The mongo storage engine api has a FAQ here but it hasn't been kept up to date much (last update was 4 months ago as of this writing).
Roughly, the api has these components that need to be implemented:
-
RecordStore
: responsible for storing BSON documents accessible byRecordId
. May be capped. -
SortedDataInterface
: implements an index (including the_id
index), which maps index keys toRecordId
s. May be unique. -
RecoveryUnit
: manages a transaction for the storage engine, essentially. Lives inside anOperationContext
which is passed around through most functions in the code. Arguably the most complicated part of the api because it has strange rules around atomicity and read-your-own-writes that don't always map directly to typical transactional semantics. However, we've got this mostly figured out, it shouldn't need much tweaking. -
StorageEngine
: just provides the initialization and glue that lets the server create the above three objects
You'll also be exposed to the following types of objects:
-
RecordId
: usually just an autoincrement integer (except in the case of the oplog where you have to extract it from the optime here that you get to generate inside theRecordStore
. - [
KeyString'](https://github.com/mongodb/mongo/blob/master/src/mongo/db/storage/key_string.h): The serialized form of either a
RecordIdor a BSON key, optionally concatenated with a
RecordId. Once encoded, these can be compared with
memcmp, which is a nice feature. They can be decoded back into
RecordIds and BSON keys, but to decode an index key you also need
TypeBitswhich are usually stored next to the encoded
KeyString`. -
RecordData
: just a chunk of bytes representing a record as stored on disk. It's a serialized BSON object, but you don't need to know that, just store it and hand it back.
We decided that having separate RecordStore
and SortedDataInterface
classes was a bit complicated, and a better API would match the fractal tree API more closely as just a sorted key/value dictionary. Ideally this API would also be suitable for RocksDB and WiredTiger, but it seems unlikely that they'll use this API at this point. In any case, the code in the KVDictionary layer
implements most of the complicated operations in the RecordStore
and SortedDataInterface
(including logic around capped collections and tailable cursors, as well as persisting size and count statistics for engines which can't provide those simply themselves) in terms of a simpler API called KVDictionary
.
We don't simplify the RecoveryUnit
abstraction in this layer yet, but it wouldn't be a bad idea to try.
The KVDictionary
layer requires the implementation of the following classes:
-
KVDictionary
: this is the main sorted map data structure. You need to be able to do inserts, gets, and deletes, and provide aCursor
which can seek to a key and go next in a given direction. Optionally, you can implement optimized versions of updates and duplicate key checks for unique indexes. -
KVEngineImpl
: just needs to create newKVDictionary
s andRecoveryUnit
s. You still need to implementRecoveryUnit
yourself.
The KVDictionary
layer makes use of a class Slice
which is just a contiguous range of bytes with some logic around memory management and borrowing. These are used for keys and values, it's pretty straightforward but be careful about returning owned Slice
s.
A non-persistent reference implementation of the KVDictionary
API using a std::map
is available, called kv_heap
. It's a good place to start to see what it takes to make a basic engine, there's a bit of glue code in kv_heap_init.cpp and in the SConscript that's worth checking, and you should see how the unit tests work.
The PerconaFT engine lives in tokuft and implements the KVDictionary
API, including its optional parts (updates, capped delete optimization, duplicate key checking) and makes use of some optional parts of KVDictionary
(KVSizeStorer
to maintain persistent stats, and VisibleIdTracker
to track document visibility in capped collections).
The RecoveryUnit
is a bit weird. The way we've done it is to not use child transactions, and to use a counter to track depth. We commit when the outermost commit happens, and abort if the outermost transaction ends without being committed. When we create a transaction, we peek at the lock state to see whether the client intends to write something; if they do we create a DB_SERIALIZABLE
transaction so that we take read locks, if not, we create a DB_TXN_SNAPSHOT | DB_TXN_READ_ONLY
transaction. We also have a way to detect when we're a secondary, and in that case we pretend we're prelocked because we know there will be no conflicts (https://github.com/percona/percona-server-mongodb/blob/b84f79c259a7db416e1b01f673c20b7e35a2ca7d/src/mongo/db/storage/tokuft/tokuft_dictionary.cpp#L82-L100).
The PerconaFT engine uses a new API layer above the ydb called ftcxx
, which provides a somewhat more modern, fairly opinionated C++ API for the fractal tree. By far the most complicated thing here is the cursor API, which makes heavy use of templates to be able to inline things like comparison functions for bounds checking, and to do zero-copy but inlined callbacks for result handling, as well as automatic buffering for the bulk fetch API. Unfortunately, the structure of the cursors in the mongo storage engine can't make great use of this because they need to copy out the data, but the goal was to have something useful for more than just TokuMXse in the future, and it doesn't slow down TokuMXse. The templates are nasty, but the result is actually quite nice to work with, which you can see in the example in cursor_test.cpp.