Welcome to the Op-Engine repository, a custom, lightweight storage engine in Rust designed for blockchain-like operations. This database is optimized around append-only segments, reorganization support (reorgs), and a sharded in-memory memtable for high insertion throughput. It includes:
- Sharded memtable for concurrent inserts
- Segment files for on-disk persistence
- B-Tree indexing for fast lookups in segments
- Reorg manager to roll back to a previous block height
- Configurable concurrency using a custom thread pool
- High-Level Overview
- Core Components
- On-Disk Structure
- Reorganization Flow
- Concurrency and Parallelism
- Memory, Flushing, and Segment Layout
- Indexing and Lookups
- Example Usage
- Low-Level Architecture Details
- License
Op-Engine Database is a layered system:
- Sharded in-memory tables capture writes in memory for fast insertion.
- When a shard grows beyond a configurable threshold (the
memtable_size
), data is flushed to an on-disk segment. - Each on-disk segment is accompanied by a B-Tree or similar index for rapid lookups without scanning entire files.
- Reorg logic allows rolling back recent writes if the underlying blockchain or ledger rewinds to a lower block height.
- All these pieces are brought together under a single
OpNetDB
instance, which can hold multiple “collections,” each collection storing a distinct data model (e.g.utxo
,utxo_by_address
, etc.). - A custom thread pool is used for concurrency, allowing parallel segment indexing, multi-shard insert parallelism, and more.
This structure is reminiscent of an LSM-tree style engine (in the sense of staged, sorted segments) but with extra * reorg* features, a simpler flush model, and custom concurrency.
The primary struct that application code interacts with is:
pub struct OpNetDB {
pub config: DbConfig,
pub thread_pool: Arc<Mutex<ThreadPool>>,
pub sharded_tables: Arc<RwLock<HashMap<String, ShardedMemTable>>>,
pub segment_manager: Arc<Mutex<SegmentManager>>,
pub reorg_manager: Arc<Mutex<ReorgManager>>,
pub collections: Arc<RwLock<HashMap<String, CollectionMetadata>>>,
}
DbConfig
holds paths, concurrency settings, memory thresholds, etc.- A thread pool for background tasks and parallel index loading.
- A map of collection name →
ShardedMemTable
for the in-memory portion. - A
SegmentManager
for on-disk segment creation, indexing, and rollbacks. - A
ReorgManager
that knows the current block height and can revert changes. - A collection registry so that each named collection can be typed and retrieved.
Lifecycle:
OpNetDB::new
initializes threads, sets up data paths, loads existing segments from disk, and configures the reorg state (height).register_collection
adds a new logical collection name (e.g.,"utxo"
), allocating a newShardedMemTable
.collection
returns a strongly-typed handle,Collection<T>
, so you caninsert
orget
typed records.flush_all
forces all memtables to disk, checkpointing the writes.reorg_to
allows rolling back segment files to a specified height.
A Collection<T>
is a logical store for records of type T
. Each type T
implements:
KeyProvider
to define how to produce a byte array key for the record.CustomSerialize
to define how to serialize/deserialize the record bytes.
Hence, each Collection<T>
is basically a typed key-value store where keys are derived from T
. The main methods:
insert(record, block_height)
: store or update an object in the memtable. If the memtable is too large, flush it.get(&key_args)
: get a record by key, looking first in memtable, then on disk.
Internally, each Collection
references the same SegmentManager
for on-disk data but uses a distinct shard in the
ShardedMemTable
.
The in-memory store for a collection uses multiple shards:
pub struct ShardedMemTable {
shards: Vec<Mutex<MemTableShard>>,
shard_count: usize,
max_size: usize,
}
- Each shard is protected by a separate
Mutex<MemTableShard>
so multiple threads can insert in parallel. - The
shard_for_key
function chooses which shard to use, typically by hashing the key. insert(key, value)
locks only that shard and updates the map, tracking approximate memory usage.- If the total memory usage across shards exceeds
max_size
, we flush the entire sharded table to a new segment.
Why Sharded?
- To reduce lock contention on large writes: instead of one global
Mutex
, we haveshard_count
locks, allowing parallel inserts to different shards.
When data is flushed from a memtable, the SegmentManager
:
- Writes out a
.seg
file containing all key-value pairs from the sharded memtable in an append-like fashion. - Builds a B-Tree index of
(key → file offset)
pairs, then writes it to a separate.idx
file. - Tracks each segment’s
(start_height, end_height)
, plus its loaded index in memory.
Key tasks:
flush_sharded_memtable_to_segment
: Takes all shard data, writes it to a new.seg
file, builds the index, writes.idx
.rollback_to_height(height)
: Removes any segments whoseend_height > height
, physically deleting.seg
and.idx
files.find_value_for_key(collection_name, key)
: Searches segments in reverse chronological order, looking up offsets in the B-Tree index, then reading the record from disk.
Tracks the current chain height and allows “reorg to X” if needed. Typically:
- Reorg sets the internal
current_height
toX
. - Tells
SegmentManager
torollback_to_height(X)
. - Clears any in-memory shards so we don’t have data above that height.
A simple custom thread pool providing:
- Fixed number of worker threads.
- Jobs submitted via
execute(|| { ... })
. - A
TaskHandle<T>
to retrieve typed results or wait for completion.
Used for:
- Parallel loading of segment indexes on startup.
- Potential concurrency expansions like asynchronous flush or merges.
- Data Files (
.seg
): Contains raw(key, value)
pairs appended. Each pair is stored as:u64
length of the key + the key bytesu64
length of the value + the value bytes
- Index Files (
.idx
): A B-Tree or sorted structure with(key → offset-in-.seg-file)
, read into memory for quick lookups. - The file naming format is typically:
{collection_name}_{startBlock}_{endBlock}.seg
and{collection_name}_{startBlock}_{endBlock}.idx
or sometimes with an added numeric suffix if the same(startBlock, endBlock)
is used.
Hence, each flush yields a new pair of .seg
and .idx
files.
A typical block pipeline is:
- A new block at height
H
arrives, you do several inserts intoCollection<T>
withblock_height = H
. - If
sharded_memtable
grows too large, the system flushes. The resulting.seg
and.idx
are labeled with[start=H, end=H]
. - If a chain reorg happens and you must drop blocks above
X
, callreorg_to(X)
.- The
SegmentManager
deletes any segment withend_height > X
. - The sharded memtables are cleared from memory to remove uncommitted data.
- The
- Concurrent inserts: Each
Collection::insert
picks a shard by hashing the key. Only that shard’s mutex is locked, allowing multiple threads to insert different keys concurrently. - Parallel segment indexing: On startup, each segment’s
.idx
file is loaded in parallel using the global thread pool. - Multi-thread flush: The code can be extended to write different shards in parallel. Currently, the flush is done
under a single lock in
SegmentManager
, but indexing or writing can further be parallelized.
Each collection has a memtable_size
. Once the total size of data across all shards in the ShardedMemTable
exceeds
that threshold:
- We call
SegmentManager::flush_sharded_memtable_to_segment(...)
. - The entire memtable for that collection is written out to a
.seg
file. - A new B-Tree index is built in memory and then written out as
.idx
. - The in-memory memtable is cleared.
Why entire memtable?
We currently do a “full flush”—this is simpler than partial flush. An alternative approach might flush one shard at a
time, or a fraction of the data, but we do a big chunk to reduce overhead.
Each .idx
file stores a B-Tree (BTreeIndex
) mapping (key → offset in .seg)
. For example:
index: key=someKey -> offset=1234
During lookup:
- We search from the newest segment to oldest, because the newest segment has the most recent data.
- If the key is in that B-Tree, we read the offset from disk.
- We verify the key at that offset matches exactly, then return the value.
We optionally allow scanning a range of keys (e.g., [start_key .. end_key]
) with a limit, which merges the results
from newest to oldest segments, deduplicates, and returns up to that many matches.
Below is a simplified usage snippet:
fn main() {
// Create config
let config = DbConfig {
data_path: "mydata/".into(),
wal_path: "mydata/wal/".into(),
num_threads: 4,
memtable_size: 1024 * 1024,
height: 100,
};
// Instantiate database
let db = OpNetDB::new(config).expect("Failed to init DB");
// Register a collection
db.register_collection("utxo").expect("Collection error");
// Retrieve typed handle
let utxo_coll = db.collection::<Utxo>("utxo").expect("Get coll error");
// Insert a UTXO record
let my_utxo = Utxo {
tx_id: [0xab; 32],
output_index: 0,
address: [0xcd; 33],
amount: 10_000,
script_pubkey: vec![0xAA, 0xBB, 0xCC],
deleted_at_block: None,
};
utxo_coll.insert(my_utxo, 101).expect("insert fail");
// Flush
db.flush_all(101).expect("flush fail");
// Lookup
let found = utxo_coll.get(&([0xab; 32], 0)).unwrap();
println!("Found => {:?}", found);
// Reorg
db.reorg_to(100).expect("reorg fail");
}
Below is a step-by-step breakdown of the critical paths in the engine, from insertion to on-disk layout:
-
Insertion
Collection<T>::insert(record, block_height)
:- Serialize
record
usingCustomSerialize
. - Find the shard to place the key in (
hash(key) % shard_count
). - Insert
(key, serialized_value)
intoMemTableShard
. - If the total size exceeds
memtable_size
, we flush.
- Serialize
-
Flush
SegmentManager::flush_sharded_memtable_to_segment
:- Create a
.seg
file with a large buffered writer. - Iterate over each shard:
- For each
(key, value)
, write them in raw form ([length, key bytes, length, value bytes]
). - Build a B-Tree (or in-memory index) mapping
(key → offset_in_file)
.
- For each
- Write the B-Tree out to a
.idx
file. - Append the new
SegmentMetadata
tosegments
list in memory. - Clear the sharded memtable.
- Create a
-
Query (
Collection<T>::get(key_args)
):- Convert
key_args
to a byte array (KeyProvider::compose_key
). - Look in the memtable for that key. If found, deserialize and return.
- If not found in memtable, search from newest segment to oldest:
- Use B-Tree index in memory to see if the key offset exists.
- If offset found, open
.seg
file, read the key-value at that offset, verify, return the value.
- Convert
-
Reorg (
OpNetDB::reorg_to(height)
):ReorgManager
setscurrent_height
toheight
.SegmentManager::rollback_to_height(height)
:- Remove segments above this height and delete their
.seg
+.idx
files from disk.
- Remove segments above this height and delete their
- Clear the in-memory shards to discard unflushed data beyond the old height.
This repository is licensed under the MIT License (or your chosen license).
Feel free to use, modify, and extend this code in your projects!