Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add(state): Track spending transaction ids by spent outpoints and revealed nullifiers #8895

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

arya2
Copy link
Contributor

@arya2 arya2 commented Sep 27, 2024

Motivation

We want to lookup transaction ids by their transparent inputs and revealed nullifiers.

Closes #8837,
Closes #8838.
Closes #8922.

Solution

  • Adds a new tx_loc_by_spent_out_loc column family
  • Updates revealed nullifier column families to store spending transaction locations as values instead of ()
  • Stores the TransactionLocation of spending transactions by spent OutputLocations and nullifiers when writing blocks to the finalized state
  • Adds the hashes of spending transactions as the values in the spent_utxos field on non-finalized Chains
  • Adds ReadRequest and ReadResponse variants for querying spending tx ids by outpoint with the ReadStateService
  • Adds a spending_transaction_hash() read function used to handle the new ReadRequest
  • Updates snapshots

It may be possible to update the tx_loc_by_transparent_addr_loc column instead, but adding a new one seemed easier.

Related Changes:

  • Updates cancel_receiver to a crossbeam-channel::mpmc and parallelizes the db format upgrade by block

Tests

  • Adds a test that checks the last 500 blocks in the finalized and non-finalized state for spending transaction ids (uses a cached state)
  • Manually tested the db format upgrade
  • Full Mainnet sync test running here

PR Author's Checklist

  • The PR name will make sense to users.
  • The PR provides a CHANGELOG summary.
  • The solution is tested.
  • The documentation is up to date.
  • The PR has a priority label.

PR Reviewer's Checklist

  • The PR Author's checklist is complete.
  • The PR resolves the issue.

@arya2 arya2 added C-enhancement Category: This is an improvement A-state Area: State / database changes P-Medium ⚡ labels Sep 27, 2024
@arya2 arya2 self-assigned this Sep 27, 2024
@arya2 arya2 changed the title add(state): Track spending transaction ids by spent outpoint add(state): Track spending transaction ids by spent outpoints and revealed nullifiers Oct 1, 2024
@arya2 arya2 marked this pull request as ready for review October 7, 2024 18:41
@arya2 arya2 requested a review from a team as a code owner October 7, 2024 18:41
@arya2 arya2 requested review from upbqdn and removed request for a team October 7, 2024 18:41
@arya2 arya2 added the do-not-merge Tells Mergify not to merge this PR label Oct 10, 2024
@arya2
Copy link
Contributor Author

arya2 commented Oct 10, 2024

Added a do-not-merge label so that this won't be published until after #8922 has been implemented to avoid substantially increasing storage requirements for users that won't be using these indexes.

…a read method and an update to `prepare_spending_transparent_tx_ids_batch()` for maintaining it when committing blocks to the finalized state.

Adds TODOs for remaining production changes needed for issue #8837.
…aDb instead of DiskDb, checks cancel_receiver before every db operation
@arya2 arya2 force-pushed the index-tx-loc-by-spent-out-loc branch from fd2bcca to f6faf42 Compare October 21, 2024 20:05
@arya2
Copy link
Contributor Author

arya2 commented Oct 25, 2024

@mpguerra It looks like it's actually not using much storage space, I was looking at the db metrics printed at startup which were about double the expected storage requirements prior to the change, but the total size of the state cache directory is about the same as it was before, so I think the db metrics are overestimating the total db size.

I checked the number of keys by column family, and by height 2M on Mainnet, it's ~10M transparent outputs and ~150M nullifiers in total, not all of which are spent. It's 10B per spent transparent output and 5 bytes per nullifier, so it should be, at most, ~1GB of additional storage requirements by block 2M. I'll update this comment with the number of nullifiers and transparent outputs at the network chain tip once my node finishes syncing, but it's looking like hiding this behind a feature may have been unnecessary.

Having the indexes behind a feature still seems nice to have, but there's also unnecessary complexity to be reviewed and maintained around adding/deleting the indexes. Should we keep them behind a feature in this PR or remove the feature?

Update:

At the current network chain tip, it's about 6.2GB of extra data (5.5gb + 140M * 5B), also it's 14B per spent transparent output, not 10B (I had forgot about the output index).

6.2GB doesn't seem excessive, but we could use the feature later if/when caching blocks in their compact format.

Relevant Column Family Sizes
sprout_nullifiers (Disk: 146.7 MB, Memory: 9.4 MB, num_keys: Some(1663236)) 
sapling_nullifiers (Disk: 230 MB, Memory: 4.2 MB, num_keys: Some(3068534)) 
orchard_nullifiers (Disk: 6.3 GB, Memory: 55.1 MB, num_keys: Some(134798732)) 
tx_loc_by_spent_out_loc (Disk: 5.5 GB, Memory: 6.3 MB, num_keys: Some(316786532)) 

Comment on lines +311 to +322
| `tx_loc_by_spent_out_loc` | `OutputLocation` | `TransactionLocation` | Create |
| *Sprout* | | | |
| `sprout_nullifiers` | `sprout::Nullifier` | `()` | Create |
| `sprout_nullifiers` | `sprout::Nullifier` | `TransactionLocation` | Create |
| `sprout_anchors` | `sprout::tree::Root` | `sprout::NoteCommitmentTree` | Create |
| `sprout_note_commitment_tree` | `()` | `sprout::NoteCommitmentTree` | Update |
| *Sapling* | | | |
| `sapling_nullifiers` | `sapling::Nullifier` | `()` | Create |
| `sapling_nullifiers` | `sapling::Nullifier` | `TransactionLocation` | Create |
| `sapling_anchors` | `sapling::tree::Root` | `()` | Create |
| `sapling_note_commitment_tree` | `block::Height` | `sapling::NoteCommitmentTree` | Create |
| `sapling_note_commitment_subtree` | `block::Height` | `NoteCommitmentSubtreeData` | Create |
| *Orchard* | | | |
| `orchard_nullifiers` | `orchard::Nullifier` | `()` | Create |
| `orchard_nullifiers` | `orchard::Nullifier` | `TransactionLocation` | Create |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need an update if we keep the indexer feature.

…c when trying to open the db with that column family.
@@ -6168,6 +6168,7 @@ dependencies = [
"bincode",
"chrono",
"color-eyre",
"crossbeam-channel",
Copy link
Contributor Author

@arya2 arya2 Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format upgrade runs much faster for me now, about 8 minutes for 2M blocks, which has been helpful for manual testing. I think the channel types in Rust's standard library didn't work because they weren't Send or Sync, and the ones in tokio lack blocking receive methods.

@arya2
Copy link
Contributor Author

arya2 commented Oct 25, 2024

The scan_starts_where_left test is failing here: https://github.com/ZcashFoundation/zebra/actions/runs/11513520050/job/32050881919?pr=8895#step:15:615

I thought it was because a column family could be dropped earlier, but it happened again after switching to removing a comprehensive range instead of dropping the column family, so now I'm thinking it's because it's trying to open a secondary db before opening the primary db, where opening the primary db will create any missing column families but opening the secondary panics when one is missing because it lacks write access.

I'll confirm that manually, if that is the case, I think TrustedChainSync should just log a warning saying "Run Zebra first or downgrade your Zebra version". This PR should also either bump the db format version or hide the new column family behind the indexer feature, in the latter case, that would mean Zebra can't clear the column family once it's been populated, so I'm leaning towards bumping the db format version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-state Area: State / database changes C-enhancement Category: This is an improvement do-not-merge Tells Mergify not to merge this PR P-Medium ⚡
Projects
None yet
1 participant