Testnet node database design #126

bobbinth · 2023-05-24T08:58:23Z

bobbinth
May 24, 2023
Maintainer

Discussing RPC endpoints in #121, made me thing about the backend needed to support them. Here are my preliminary thoughts on this.

Overall, the node needs to manage 4 separate data sets: accounts, notes, nullifiers, and blocks. I've tried to keep the design as simple as possible (i.e., not create a sophisticated relational database) and also rely on purely in-memory structures in some cases. I think this is fine for testent purposes and we can optimize things later. Let's go through these one-by-one.

Account DB

Account database keeps track of the latest state of accounts and facilitates block production/verification. In my mind, it consists of two parts:

A flat key-value store which maps an account ID to account data. For private accounts, account data would contain just the hash, but for public accounts, it would contain data required to instantiate the Account object. For testnet, I'll assume that we always load a full account object into memory. Beyond testnet, might need to implement "lazy-loading" for the store portion of account storage.
A sparse Merkle tree which is used to compute commitments to the account database and assist with generating Merkle proofs for account states. This would be a variation on our TieredSmt, with the main difference being that keys in this TSMT would be 64-bit values (need to come up with a good name for this data structure).

The flat key-value map would be persisted in a database (or on disk), but the Merkle tree would live solely in memory, and for testnet purposes, it would be built every time the node starts up. Later we can optimize this, but I think for testnet (and maybe even beyond) this should be sufficient.

Note DB

Note database keeps track of all notes ever created. In my mind it consists of 3 parts:

Block note data - this would store notes created in every block in a format similar to the one described in Testnet RPC Requirements #121 (reply in thread).
Public note data - a flat key-value map mapping note hashes to full public note details (i.e., script, vault etc.).
Block tag index - keeping track of min/max tag for each block.

The first two of the above components would be persisted in a database (or on disk), but the 3rd component would live solely in memory (and would be built on node start-up). It would be just a simple vector storing (min_tag, max_tag) for each block. This index would be used to assist get_notes_by_tag RPC endpoint.

We could also build an index to support mapping note_hash |-> (block_num, index) if we want to provide get_note_by_hash RPC endpoint.

For testnet, we won't implement any note data pruning and would assume that nothing gets deleted from note databases.

Nullifier DB

Nullifier database stores nullifiers of consumed notes. For testnet, we won't implement epoch-based nullifiers and will assume that there is a single nullifier database. Similar to the other database, this database would consist of two components:

A flat key-value map mapping nullifier to the block number in which the nullifier was created.
A Sparse Merkle tree used to compute commitment to the nullifier database and assist with generating Merkle proofs for nullifiers. For testnet, this would be our TieredSmt.

The first component would be persisted in a database (or on disk), while the second component would live solely in memory and would be built on node start-up.

Block DB

This database would keep track of all produced blocks. In my mind, it consists of the following components:

A flat key-value map mapping block_num |-> block_header.
A flat key-value map mapping block_num |-> block_data where block data includes:
a. A list of nullifiers created in in this block.
b. A list of (account_id, account_hash) tuples for all accounts updated in this block. For public accounts, we'd also need to store state/vault deltas.
c. Potentially a list of transaction hashes for transactions executed in this block.
A flat key-value map mapping block_num |-> proof
An MMR consisting of block headers.

All of the above, except for MMR would be persisted in a database (or on disk). MMR would be our Mmr struct and it would live fully in memory (and instantiated on node start-up).

Maybe using a key-value map for the above is an overkill since all of the data indexed by block_num which grows monotonically and without gaps.

Also, since all of them use block_num as the key, there could be an argument for combining them into a single object. But I think it might make sense to keep them separate as we may want to prune different components differently (i.e., proofs can be discarded much sooner as compared to block header data).

frisitano · 2023-05-24T11:50:36Z

frisitano
May 24, 2023

The flat key-value map would be persisted in a database (or on disk), but the Merkle tree would live solely in memory, and for testnet purposes, it would be built every time the node starts up. Later we can optimize this, but I think for testnet (and maybe even beyond) this should be sufficient.

I like the idea of keeping the account Merkle tree in testnet! This will make things much more practical and achievable. I think we should do some benchmarks to understand the implications of this implementation. Specifically:

determine the memory consumption of the tree at different capacity - e.g. 1000 / 10000 / 100000 accounts etc.
determine the time it takes to instantiate the tree from cold based on tree capacity - e.g. 1000 / 10000 / 100000 etc.

We should define some reference hardware to run these benchmarks.

The first two of the above components would be persisted in a database (or on disk), but the 3rd component would live solely in memory (and would be built on node start-up). It would be just a simple vector storing (min_tag, max_tag) for each block. This index would be used to assist get_notes_by_tag RPC endpoint.

Can you elaborate on how this note implementation supports the get_notes_by_tag endpoint. It looks like this would be vulnerable to a relatively simple attack - a malicious actor creates 2 notes per block, one with a tag = 0 and the other with a tag = Felt::MAX. Is the intention to iterate over all blocks to see if the target tag exists inside the range (min_tag, max_tag)? Maybe some type of bloom filter would be more appropriate (however I think @hackaugusto has some valid reservations about this)?

An alternative mechanic would be to introduce the concept of "note epochs". A note epoch would define some period of time (or number of blocks), lets say its 7 days. We could then have a database key-value map of hash(tag | epoch_number) -> Vec<NoteOrigin>. A client would then need to sync every 7 days via a single rpc request. We could tune the note epoch length to make it more optimal. Maybe daily epochs would be better.

8 replies

frisitano May 24, 2023

A relatively large number of false positives is actually desirable here for privacy reasons. For example, we may want to get 10x or maybe even 100x of what we actually want so that the node serving the request cannot learn which exact notes we are interested in.

I think a bloom filter would be well suited for this purpose.

With bloom filters, is there a way to index somehow? Or do we need to run the entire dataset through a bloom filter to figure out what matches?

We would update the bloom filter on the fly as new notes are produced. We seal a bloom filter periodically and start a new one. We have a two main parameters that we can tweak to support our ambitions of privacy. These are, the size of the note set per bloom filter and the size of the bloom filter.

We could then index using a map of hash(bloom_filter_number, bloom_filter_target) -> Vec<NoteOrigin>.

bobbinth May 24, 2023
Maintainer Author

We would update the bloom filter on the fly as new notes are produced. We seal a bloom filter periodically and start a new one.

hmmm - maybe I'm not understanding the construction. I was imagining that a bloom filter would be provided with each get_notes_by_tag request, and then we'd need to figure out which notes match this bloom filter. But seems like that's not how you were thinking about it?

frisitano May 24, 2023

So my understanding of how it could potentially work is as follows. You wouldn't want to have a single bloom filter for all historical notes created ever as then the bloom filter will be saturated. Instead we should create sets of notes and for each set we assign the set a bloom filter. The way we define the notes in a set could be a specific number of blocks - lets say 500. So each set of notes created in a 500 block window would be assigned a bloom filter. Say as I client I last synced 2000 blocks ago and I need get the notes associated with a tag I'm interested in. I would calculate the bloom filter associated with my tag bloom_target = calculate_bloom_filter(tag). I would then shoot a request to the RPC which would be get_notes_by_bloom_filter(target=bloom_target, last_sync=2000). The RPC server would then check the 4 most recent bloom filters to see if any of them have a match with the bloom_target. If we find a match we would then look up the items that are associated with the bloom_target from the key-value database. The bloom target table will map hash(bloom_target | bloom_filter_number) -> Vec<NoteMetadata> where the vector holds all of the notes associated with the bloom_target for the bloom_filter_number of interest.

bobbinth May 28, 2023
Maintainer Author

The RPC server would then check the 4 most recent bloom filters to see if any of them have a match with the bloom_target.

This would imply that we need to figure out if two bloom filters have a non-empty intersection set, right? My knowledge of bloom filters is very limited - so, I'm not sure how easy/efficient this is to do. If it work, this could be a great solution though.

If the above proves to be too complicated/inefficient, we could probably devise simpler ways to build indexes for building partial matches on note tags.

frisitano May 29, 2023

This would imply that we need to figure out if two bloom filters have a non-empty intersection set, right? My knowledge of bloom filters is very limited - so, I'm not sure how easy/efficient this is to do. If it work, this could be a great solution though.

Yes that is correct, we would need to check for non-empty intersection but this should be very efficient - a single database lookup. I think using a bloom filter is an ideal solution for the problem at hand as it gives us a number of tuneable parameters that allow us to target the differential privacy / false positive rates that we desire.

If the above proves to be too complicated/inefficient, we could probably devise simpler ways to build indexes for building partial matches on note tags.

The primary computational overhead is the computation of the k hash functions for each note_tag - both insertion and membership checking is O(k). However hash function evaluation can be parallelized and we would use fast has functions. I would certainly agree that matching on partial note tags is a simpler and more efficient solution, however it does leak some information which is undesiriable.

bobbinth · 2023-05-27T18:01:57Z

bobbinth
May 27, 2023
Maintainer Author

For implement the data store we'd need to pick some persistent database to store the data (I'm discounting such options as using a file system directly or writing our own database as infeasible). Assuming we go with Rust, I did a cursory research of available options, and here is a summary:

Rust-native databases

There are a few Rust-native databases which I came across, most interesting of them are:

sled - looks very nice and pretty widely used (300K+ recent downloads, 7K+ stars on Github), but unfortunately it doesn't seem like it is being maintained any more.
redb - seems to be very promising with nice APIs - but fairly new (12K recent downloads, 1.6K stars on Github), and not yet stable.

Embedded databases with Rust bindings

LMDB solid, battle-tested, and very widely used database, but Rust bindings are either unmaintained (e.g., lmdb) or are not stable enough yet (e.g., heed).
libmdbx seems to be an improvement over LMDB (I believe Erigon and reth projects use it). Rust bindings (libmdbx-rs) seem to be well-maintained and relatively widely used (75K+ recent downloads), but there seems to be only one maintainer and documentation for the bindings is sparse.
LevelDB a widely used database used by Geth and Bitcoin. Rust bindings (leveldb) seem to be unmaintained though.
RocksDB a widely used databased developed by Facebook (also used by Sui, Aptos and probably others). Rust bindings (rocksdb) are well-maintained and documented. They are also seem to be quite popular and widely used (780K+ recent downloads, 1.5K+ stars on Github).
SQLite one of the most popular embedded databases in the world. Rust bindings (rusqlite) are well documented, maintained, and seem to be very widely used (1.2M+ recent downloads, 2.1K stars on Github). But it is somewhat slower than other options (because it is a full-fledge relational database).

Client-server databases

PostgreSQL - probably one of the best client-server databases. Hermez team is using this for zkEVM. Rust bindings (Rust-Postgres) seem to be well-maintained, well-documented, and widely used (340K+ recent downloads, 2.9K stars on Github).

Viable options

As much as I would love to use a Rust-native database, I don't think the options we have now are compelling (unless I missed something of course). I also don't think we should go with a client-server database at this point to avoid the complexity of dealing with a separate database server.

This leaves embedded databases with Rust bindings. Here, I think we care more about stable, well-maintained and well-documented options, which in my mind narrows things down to either RocksDB or SQLite. And at this point, I'm leaning more towards SQLite - thought, we should think through all pros and cons.

Also, a very interesting post from Erigon: Choice of storage engine.

5 replies

hackaugusto May 28, 2023

IMO SQLite is a solid option. It is widely deployed. It is one of the best tested OSS projects out there. It has bindings to lots of languages, which makes portability easy (e.g. a standalone synchronization program that outputs a sqlite file, that can be used by other applications). And using SQL from the start makes using a proper DB in the future easier to implement, if the need ever arises.

Using a KV-store is an option, but it would make our lives a little bit harder when RPC endpoints that scan data are added (e.g. getting all the notes generated by a on-chain account, getting notes for a specific range, etc.), since we would need to build the iteration ourselves. (It is definitely doable, for example Spanner/Coackroach built a distributed SQL on top of KV store, I'm just saying it will be more work in direct comparison to SQLite)

The only downside is that it has limited support for concurrent writes ref. It should not be a big deal if the application is designed properly, and if it becomes a bottleneck it is possible to use multiple DBs to increase concurrency.

frisitano May 29, 2023

Another rust native database is parity-db which is used in substrate as a replacement for RocksDb which was being used previously. I believe the main motivation for the migration is because RocksDb doesn't support reference counting. However parity-db is also relatively new and has a dependency on parity team to continue support which is undeseriable.

I think that a relational database (with its rich feature set) could make our lives easier in certain circumstances - however I'm not sure if the performance cost is justified, especially in a blockchain system in which scalability is foundational. To evaluate this objectively we should try and define our requirements on the database. We can then review them for our options and see which is best suited.

Typically the database becomes the bottleneck in blockchain systems so I think it's important to get this right - see this talk by the lead developer of Geth for more insight. In the short term I don't think this will be much of a problem as we can fit state in memory but this probably wont be the case forever.

bobbinth May 29, 2023
Maintainer Author

I think that a relational database (with its rich feature set) could make our lives easier in certain circumstances - however I'm not sure if the performance cost is justified, especially in a blockchain system in which scalability is foundational.

Agreed and performance is one of my main concerns here, and we should definitely do more research here. But a couple of things which make me think that it might be OK to use SQLite for now:

Looking at some performance data, it seems like SQLite maybe just as performant for reads as RocksDB, but about 50x - 80x slower for writes. This is not something I ran myself - so, things may be very different for our use case.
My thinking is that swapping out database engine should be relatively simple before we go to mainnet, assuming we encapsulate the logic well. So, if in our benchmarking we find out that SQLite will get us in trouble down the road, we should be able to replace it with something else (and we'll already have a baseline to compare to).

Typically the database becomes the bottleneck in blockchain systems so I think it's important to get this right - see this talk by the lead developer of Geth for more insight.

Agreed (and I love this talk)! That's one of the reasons I am hoping that we can keep most of Merkle data in memory. For account DB that shouldn't be a problem even in the long term - but for nullifier DB we might not be able to do it at high TPS (i.e., over 100 TPS).

frisitano May 29, 2023

My thinking is that swapping out database engine should be relatively simple before we go to mainnet, assuming we encapsulate the logic well. So, if in our benchmarking we find out that SQLite will get us in trouble down the road, we should be able to replace it with something else (and we'll already have a baseline to compare to).

Yeah encapsulating the database makes sense to me. We should keep this in mind when implementing the db logic.

Agreed (and I love this talk)! That's one of the reasons I am hoping that we can keep most of Merkle data in memory. For account DB that shouldn't be a problem even in the long term - but for nullifier DB we might not be able to do it at high TPS (i.e., over 100 TPS).

Agreed, its a super insightful talk! Could you provide a short explanation about how the TSMT achieves superior memory efficient compared to the MPT used in Ethereum when you get a chance please? I understand the broad strokes but it would be good to see a comparison.

bobbinth May 29, 2023
Maintainer Author

Could you provide a short explanation about how the TSMT achieves superior memory efficient compared to the MPT used in Ethereum when you get a chance please?

I'm actually not sure if it is more memory efficient (could be a bit less efficient, actually, as we prioritize efficiency within the VM quite a bit). I guess I'm assuming that the machines which will run full nodes will have significant amount of RAM (like 32GB) and that dedicating 4G - 8GB of it to account Merkle tree would not be a big deal.

With this much RAM, we should be able to store the entire account tree in memory until about 100M accounts (this is just my high-level estimate - so, I could be a bit off). Beyond that, we could make optimizations where, for example, we store most upper levels (e.g., depth 0 - 26) in RAM, and then levels beyond that are stored on disk in sub-trees which occupy one page each. This way, we may be able to get away with something like 1 - 2 disk reads per Merkle branch even when number of accounts goes beyond 1 billion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testnet node database design #126

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 13 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Testnet node database design #126

bobbinth May 24, 2023 Maintainer

Account DB

Note DB

Nullifier DB

Block DB

Replies: 2 comments · 13 replies

frisitano May 24, 2023

frisitano May 24, 2023

bobbinth May 24, 2023 Maintainer Author

frisitano May 24, 2023

bobbinth May 28, 2023 Maintainer Author

frisitano May 29, 2023

bobbinth May 27, 2023 Maintainer Author

Rust-native databases

Embedded databases with Rust bindings

Client-server databases

Viable options

hackaugusto May 28, 2023

frisitano May 29, 2023

bobbinth May 29, 2023 Maintainer Author

frisitano May 29, 2023

bobbinth May 29, 2023 Maintainer Author

bobbinth
May 24, 2023
Maintainer

Replies: 2 comments 13 replies

frisitano
May 24, 2023

bobbinth May 24, 2023
Maintainer Author

bobbinth May 28, 2023
Maintainer Author

bobbinth
May 27, 2023
Maintainer Author

bobbinth May 29, 2023
Maintainer Author

bobbinth May 29, 2023
Maintainer Author