Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add chunk application stats #12797

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

feat: add chunk application stats #12797

wants to merge 13 commits into from

Conversation

jancionear
Copy link
Contributor

@jancionear jancionear commented Jan 24, 2025

This is the first step towards per-chunk metrics (#12758).

This PR adds a new struct - ChunkApplyStats - which keeps information about things that happened
during chunk application. For example how many transactions there were, how many receipts, what were
the outgoing limits, how many receipts were forwarded, buffered, etc, etc.

For now ChunkApplyStats contain mainly data relevant to the bandwidth scheduler, in the future
more stats can be added to measure other things that we're interested in. I didn't want to add too
much stuff at once to keep the PR size reasonable.

There was already a struct called ApplyStats, but it was used only for the balance checker. I
replaced it with BalanceStats inside ChunkApplyStats.

ChunkApplyStats are returned in ApplyChunkResult and saved to the database for later use. A new
database column is added to keep the chunk application stats. The column is included in the standard
garbage collection logic to keep the size of saved data reasonable.

Running neard view-state chunk-apply-stats allows node operator to view chunk application stats
for a given chunk. Example output for a mainnet chunk:

Click to expand
$ ./neard view-state chunk-apply-stats --block-hash GKzyP7DVNw5ctUcBhRRkABMaC2giNSKK5oHCrRc9hnXH --shard-id 0
...
V0(
    ChunkApplyStatsV0 {
        height: 138121896,
        shard_id: 0,
        is_chunk_missing: false,
        transactions_num: 35,
        incoming_receipts_num: 103,
        receipt_sink: ReceiptSinkStats {
            outgoing_limits: {
                0: OutgoingLimitStats {
                    size: 102400,
                    gas: 18446744073709551615,
                },
                1: OutgoingLimitStats {
                    size: 4718592,
                    gas: 300000000000000000,
                },
                2: OutgoingLimitStats {
                    size: 102400,
                    gas: 300000000000000000,
                },
                3: OutgoingLimitStats {
                    size: 102400,
                    gas: 300000000000000000,
                },
                4: OutgoingLimitStats {
                    size: 102400,
                    gas: 300000000000000000,
                },
                5: OutgoingLimitStats {
                    size: 102400,
                    gas: 300000000000000000,
                },
            },
            forwarded_receipts: {
                0: ReceiptsStats {
                    num: 24,
                    total_size: 6801,
                    total_gas: 515985143008901,
                },
                2: ReceiptsStats {
                    num: 21,
                    total_size: 6962,
                    total_gas: 639171080456467,
                },
                3: ReceiptsStats {
                    num: 58,
                    total_size: 17843,
                    total_gas: 1213382619794847,
                },
                4: ReceiptsStats {
                    num: 20,
                    total_size: 6278,
                    total_gas: 235098003759589,
                },
                5: ReceiptsStats {
                    num: 4,
                    total_size: 2089,
                    total_gas: 245101556851946,
                },
            },
            buffered_receipts: {},
            final_outgoing_buffers: {
                0: ReceiptsStats {
                    num: 0,
                    total_size: 0,
                    total_gas: 0,
                },
                2: ReceiptsStats {
                    num: 0,
                    total_size: 0,
                    total_gas: 0,
                },
                3: ReceiptsStats {
                    num: 0,
                    total_size: 0,
                    total_gas: 0,
                },
                4: ReceiptsStats {
                    num: 0,
                    total_size: 0,
                    total_gas: 0,
                },
                5: ReceiptsStats {
                    num: 0,
                    total_size: 0,
                    total_gas: 0,
                },
            },
            is_outgoing_metadata_ready: {
                0: false,
                2: false,
                3: false,
                4: false,
                5: false,
            },
            all_outgoing_metadatas_ready: false,
        },
        bandwidth_scheduler: BandwidthSchedulerStats {
            params: None,
            prev_bandwidth_requests: {},
            prev_bandwidth_requests_num: 0,
            time_to_run_ms: 0,
            granted_bandwidth: {},
            new_bandwidth_requests: {},
        },
        balance: BalanceStats {
            tx_burnt_amount: 4115983319195000000000,
            slashed_burnt_amount: 0,
            other_burnt_amount: 0,
            gas_deficit_amount: 0,
        },
    },
)

The stats are also available in ChainStore, making it easy to read them from tests.
In the future we could also add an RPC endpoint to make the stats available in debug-ui.

The PR is divided into commits for easier review.

@jancionear jancionear requested a review from wacban January 24, 2025 18:38
@jancionear jancionear requested a review from a team as a code owner January 24, 2025 18:38
@jancionear jancionear requested a review from mooori January 24, 2025 18:39
@jancionear
Copy link
Contributor Author

/cc @mooori @nagisa
We could add more stats to ChunkApplyStats to help analyze runtime performance - where the gas and time is spent, what limits were hit, etc.

*block_hash,
shard_uid.shard_id(),
apply_result.stats,
);
Copy link
Contributor Author

@jancionear jancionear Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saving chunk stats here means that only chunks applied inside blocks will have their stats saved. Stateless chunk validators will not save any stats. In the future we could change it to save it somewhere else, but it's good enough for the first version.

@@ -462,7 +467,8 @@ impl DBCol {
| DBCol::StateHeaders
| DBCol::TransactionResultForBlock
| DBCol::Transactions
| DBCol::StateShardUIdMapping => true,
| DBCol::StateShardUIdMapping
| DBCol::ChunkApplyStats => true,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope that marking this column as cold is enough to avoid garbage collection on archival nodes? I think these stats should be kept forever on archival nodes. They are not that big and it would be nice to be able to view stats for chunks older than three epochs.

/// The stats can be read to analyze what happened during chunk application.
/// - *Rows*: BlockShardId (BlockHash || ShardId) - 40 bytes
/// - *Column type*: `ChunkApplyStats`
ChunkApplyStats,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I thought that I could use ChunkHash as a key in the database, but that doesn't really
work. The same chunk can be applied multiple times when there are missing chunks, and I think chunks
created using the same prev_block would have the same hash (?).

@@ -648,6 +648,7 @@ impl<'a> ChainStoreUpdate<'a> {
self.gc_outgoing_receipts(&block_hash, shard_id);
self.gc_col(DBCol::IncomingReceipts, &block_shard_id);
self.gc_col(DBCol::StateTransitionData, &block_shard_id);
self.gc_col(DBCol::ChunkApplyStats, &block_shard_id);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could use some other garbage collection logic to keep the stats for longer than three epochs. Maybe something similar to LatestWitnesses where the last N witnesses are kept in the database? It's annoying that useful data like these stats disappears after three epochs, especially in tests which have to run for a few epochs. Can be changed later.

@@ -336,7 +327,7 @@ impl Runtime {
apply_state: &ApplyState,
signed_transaction: &SignedTransaction,
transaction_cost: &TransactionCost,
stats: &mut ApplyStats,
stats: &mut ChunkApplyStatsV0,
) -> Result<(Receipt, ExecutionOutcomeWithId), InvalidTxError> {
let span = tracing::Span::current();
metrics::TRANSACTION_PROCESSED_TOTAL.inc();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime metrics could probably be refactored so that first we collect the stats and at the very end
we record all of the stats in the metrics. Would reduce clutter in the runtime code.

Copy link

codecov bot commented Jan 24, 2025

Codecov Report

Attention: Patch coverage is 71.79487% with 77 lines in your changes missing coverage. Please review.

Project coverage is 70.53%. Comparing base (6f11ae3) to head (ccd5e01).
Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
tools/state-viewer/src/commands.rs 0.00% 18 Missing ⚠️
chain/chain/src/store/mod.rs 36.00% 16 Missing ⚠️
runtime/runtime/src/lib.rs 57.69% 7 Missing and 4 partials ⚠️
core/store/src/adapter/chain_store.rs 0.00% 9 Missing ⚠️
core/primitives/src/chunk_apply_stats.rs 92.39% 7 Missing ⚠️
runtime/runtime/src/congestion_control.rs 87.71% 7 Missing ⚠️
tools/state-viewer/src/cli.rs 0.00% 6 Missing ⚠️
.../runtime-params-estimator/src/estimator_context.rs 0.00% 2 Missing ⚠️
core/primitives/src/bandwidth_scheduler.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master   #12797    +/-   ##
========================================
  Coverage   70.53%   70.53%            
========================================
  Files         846      847     +1     
  Lines      174927   175254   +327     
  Branches   174927   175254   +327     
========================================
+ Hits       123389   123623   +234     
- Misses      46285    46376    +91     
- Partials     5253     5255     +2     
Flag Coverage Δ
backward-compatibility 0.16% <0.00%> (-0.01%) ⬇️
db-migration 0.16% <0.00%> (-0.01%) ⬇️
genesis-check 1.40% <0.00%> (+0.05%) ⬆️
linux 70.11% <71.79%> (+0.98%) ⬆️
linux-nightly 70.17% <71.79%> (+0.02%) ⬆️
pytests 1.70% <0.00%> (+0.05%) ⬆️
sanity-checks 1.51% <0.00%> (+0.05%) ⬆️
unittests 70.37% <71.79%> (+<0.01%) ⬆️
upgradability 0.20% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -0,0 +1,218 @@
use std::collections::BTreeMap;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a part of primitives? Isn't there an obvious conceptual "producer" crate which all dependents use that could hold this type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially put it in node-runtime, but then I needed the struct in near-store and that doesn't depend on node-runtime so I moved the struct to primitives. It's a primitive struct that is used in multiple crates, so that seemed like good fit.

In the future there might be more crates that make use of these stats, maybe a custom aggregator which downloads stats from multiple nodes and aggregate them somehow. It would be nice to have a small crate that the aggregator can import without importing all of runtime.

If there's a better place for it please let me know.

/// Useful for debugging, metrics and sanity checks.
#[derive(Debug, Clone, BorshSerialize, BorshDeserialize)]
pub enum ChunkApplyStats {
V0(ChunkApplyStatsV0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible for us to find a way to avoid versioning headaches with this mostly internal data? I don't think it is going to be painful if we make the old data inaccessible if the schema changes, we should take advantage of that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These stats might be consumed by other services in the future - debug ui, custom stats aggregators, etc, so I wanted to have a (mostly) stable interface that they could depend on. My first thought was to make it versioned, but maybe there's other ways to go about it.

@@ -1454,7 +1475,7 @@ pub struct StateStats {
// The account that is in the middle of the state in respect to storage.
pub middle_account: Option<StateStatsAccount>,
// The total size of all accounts leading to the middle account.
// Can be used to determin how does the middle account split the state.
// Can be used to determine how does the middle account split the state.
pub middle_account_leading_size: Option<ByteSize>,

pub top_accounts: BinaryHeap<StateStatsAccount>,
Copy link
Contributor Author

@jancionear jancionear Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started a mainnet node with this branch to test it out, but sadly the node ran out of disk space and crashed :/
I ran ./neard database analyse-data-size-distribution and the total size of the columns was only 250GB, ChunkApplyStats took up only 40MB.
But the sstable files somehow add up to 2TB of data :/ Debugging....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants