feat(block-producer): instrument block building #675

Mirko-von-Leipzig · 2025-02-10T11:07:17Z

This PR reworks the instrumentation of the block builder to

align it more closely with open-telemetry standards, and
improve the telemetry and span information emitted

I also fix remote span context not being sent (propagator was not being set).

Implementation

I've leant heavily on #[instrument] because manually instrumenting sync and async function calls is painful and error prone. The former requires span.enter() and span.exit(), and the latter requires async_call.instrument(span).await. This is easy to forget.

The downside is that code must first be structured by function/stage because #[instrument] cannot be applied to code blocks { ... }. Though maybe this is an upside as well. I essentially structured the pipeline based on how I wanted the span information to be tracked.

Updated spans

refactored block building to better structure spans
store client to store.client.{method} though maybe this should be store.client/{method}
mempool to mempool.{method}

Visual aid

Visualized span result using tracing-forest (note: forest cannot trace across RPC boundaries so I've manually merged those in and marked them with *).

Note that I haven't really focussed on the store server aspect here, I'm on the fence about the naming convention to use. The standard mentions namespacing via periods aka x.y.z, but RPC routes should follow {package}.{service}/{method}. All this to say that the store client/server have several styles currently.

INFO     block_builder.build_block [ 1.62ms | 3.67% / 100.00% ]
INFO     ┝━ block_builder.select_block [ 28.6µs | 0.28% / 1.76% ]
INFO     │  ┝━ mempool.lock [ 957ns | 0.06% ]
INFO     │  ┕━ mempool.block.select [ 23.1µs | 1.43% ]
INFO     ┝━ block_builder.get_block_inputs [ 144µs | 0.23% / 8.89% ]
INFO     │  ┝━ block_builder.summarize_batches [ 958ns | 0.06% ]
INFO     │  ┕━ store.client.get_block_inputs [ 139µs | 8.60% ]
INFO     │     ┕━ * store.rpc/GetBlockInputs [ 181µs | 27.20% / 100.00% ]
INFO     │        * ┕━ store.server.get_block_inputs [ 132µs | 54.00% / 72.80% ]
INFO     │        *   ┝━ validate_nullifiers [ 958ns | 0.53% ]
INFO     │        *   ┝━ read_account_ids [ 375ns | 0.21% ]
INFO     │        *   ┝━ validate_notes [ 583ns | 0.32% ]
INFO     │        *   ┝━ select_block_header_by_block_num [ 11.5µs | 6.37% ]
INFO     │        *   ┝━ select_note_inclusion_proofs [ 12.3µs | 6.83% ]
INFO     │        *   ┕━ select_block_headers [ 8.21µs | 4.54% ]
INFO     ┝━ block_builder.prove_block [ 1.34ms | 25.23% / 82.92% ]
INFO     │  ┝━ execute_program [ 926µs | 57.10% ]
INFO     │  ┕━ block_builder.simulate_proving [ 9.50µs | 0.59% ]
INFO     ┝━ block_builder.inject_failure [ 3.38µs | 0.21% ]
INFO     ┕━ block_builder.commit_block [ 41.4µs | 0.13% / 2.56% ]
INFO        ┝━ store.client.apply_block [ 23.3µs | 1.44% ]
INFO        │   ┕━ * store.rpc/ApplyBlock [ 182µs | 24.24% / 100.00% ]
INFO        │      * ┕━ store.server.apply_block [ 138µs | 24.95% / 75.76% ]
INFO        │      *    ┕━ apply_block [ 92.5µs | 27.65% / 50.80% ]
INFO        │      *       ┝━ select_block_header_by_block_num [ 10.3µs | 5.65% ]
INFO        │      *       ┝━ update_in_memory_structs [ 31.9µs | 17.50% ]
INFO        │      *       ┕━ ｉ [info]: apply_block successful
INFO        ┝━ mempool.lock [ 250ns | 0.02% ]
INFO        ┕━ mempool.block.commit [ 15.8µs | 0.85% / 0.97% ]
INFO           ┕━ mempool.transactions.revert [ 2.00µs | 0.12% ]

Tidbits:

I included the lock time of the mempool so we can see contention. I think this will make an interesting stat to monitor.
I'm unsure about sub-namespacing the mempool i.e. mempool.block.xxx vs mempool.block_xxx.
The transactions revert in mempool is due to the tx expiring - I should maybe add a span there to make it obvious?

Mirko-von-Leipzig · 2025-02-10T11:13:32Z

crates/block-producer/src/mempool/mod.rs

+#[derive(Clone)]
+pub struct SharedMempool(Arc<Mutex<Mempool>>);
+
+impl SharedMempool {
+    #[instrument(target = COMPONENT, name = "mempool.lock", skip_all)]
+    pub async fn lock(&self) -> MutexGuard<'_, Mempool> {
+        self.0.lock().await
+    }
+}


Stronger typing so we could always have the mempool.lock span.

crates/block-producer/src/block_builder/mod.rs

Mirko-von-Leipzig · 2025-02-10T11:18:20Z

crates/block-producer/src/mempool/mod.rs

-
+    /// Panics if there is no block in flight.
+    #[instrument(target = COMPONENT, name = "mempool.block.rollback", skip_all)]
+    pub fn rollback_block(&mut self) {


No longer takes block_number as this made state tracking more painful -- since the error case in the block builder wouldn't have direct access to the block number. We can revert this change, but I don't think it makes a difference at this stage.

Cargo.toml

Mirko-von-Leipzig · 2025-02-10T12:17:52Z

Something else that's missing is a transaction count for the block. This data isn't directly accessible via ProvenBatch though we could maybe pull it in using the mempool information. I don't really like this idea though - feels like coupling.

@PhilippGackstatter is the transaction list/count something we are going to add to the proven batch? I recall there was some discussion.

PhilippGackstatter · 2025-02-10T12:30:38Z

@PhilippGackstatter is the transaction list/count something we are going to add to the proven batch? I recall there was some discussion.

Yeah, there is 0xPolygonMiden/miden-base#1121 which adds VerifiedTransaction which, would be added to ProvenBatch. Then we'd have the tx count. This seems to be planned vor v0.9. Would that be good enough or should we add the transaction count before that?

Mirko-von-Leipzig · 2025-02-10T13:56:25Z

@PhilippGackstatter is the transaction list/count something we are going to add to the proven batch? I recall there was some discussion.

Yeah, there is 0xPolygonMiden/miden-base#1121 which adds VerifiedTransaction which, would be added to ProvenBatch. Then we'd have the tx count. This seems to be planned vor v0.9. Would that be good enough or should we add the transaction count before that?

Its fine, we can wait on it. Its just nice pictures to look at.

Mirko-von-Leipzig · 2025-02-10T16:33:48Z

crates/block-producer/src/block_builder/mod.rs

+            .and_then(|proven_block| async { self.inject_failure(proven_block) })
+            .and_then(|proven_block| self.commit_block(mempool, proven_block))
+            // Handle errors by propagating the error to the root span and rolling back the block.
+            .inspect_err(|err| Span::current().set_status(Status::Error { description: format!("{err:?}").into() }))


This kind of sucks, but tracing doesn't give you an error report. One option is to wrap this error propagation here with anyhow just for error reporting.

However there is a deeper issue, and that is that #[instrument(err)] does not do error reporting either.

I'm unsure what the correct general solution is; possibly something to do with a custom trace layer/visitor/subscriber. I tried a few things but I'm giving up for now.

@PhilippGackstatter maybe you stumbled into something?

Yeah, this is a larger problem in the node and in the client for which I wanted to create issues but didn't have time for yet. Basically, format!("{error}"); is always wrong now, because it only displays the first error in the chain and effectively erases the (hopefully) useful source errors. (I guess in some cases we may want to erase details to not leak info, but that is already handled explicitly like in impl From<AddTransactionError> for tonic::Status).
So wrapping with anyhow is one solution, or a small custom error report utility function is another. The client has this problem on the Rust -> Wasm boundary, where one has to convert the Rust error types into JsValue and this is also done with Display.

In my previous job I was also working on Wasm bindings for a Rust library and we implemented our own small utility error report function, which is just a couple of lines. We don't really need backtraces, which is where anyhow would be useful, so I think we don't necessarily need to depend on anyhow here.

We can add this to miden-base, because, ideally, we also change this in the client and then we have a central place to format these reports. Wdyt?

The only issue is that the formatting solution doesn't work for #[instrument(err)] because it just uses the display impl of the error. We would have to experiment with injecting the formatter into tracing itself - then it would be a good solution.

The more robust approach would be to create a new attribute macro to replace thiserror which adds the error report as the display impl. If we remove thiserror then this neatly solves this problem everywhere.

I'm actually somewhat leaning towards the macro option tbh.

I don't have much experience with tracing, but I'm hoping there's a way to inject a formatter, maybe with https://docs.rs/tracing-error/latest/tracing_error/.

I think I saw an issue in thiserror repo asking if format!("{err:#}") could produce a report, but I don't think that was implemented. Not sure if we should replace thiserror with our own macro though. It would be the exception to the rule in the Rust ecosystem and if we do ever include a "foreign" crate that uses regular thiserror, then we'd have to deal with that exception too. This may already be the case with some dependencies in miden-base or miden-client. So to me the most universal solution seems to be the custom error reporter.

Some more digging and injecting a custom error reporter is not possible for tracing at the moment.

iiuc tracing_error injects span information into a new error type - and not the other way around. So its a way of creating a new error that has a form of backtrace via spans.

Relevant issues and PRs

Syntax for recording errors in events and spans tokio-rs/tracing#3067

attributes: support recording error as std::error::Error tokio-rs/tracing#3057

#[tracing::instrument(err)] doesn't work with opentelemetry / datadog tokio-rs/tracing#2648

Summary

The issue is two-fold:

tracing supports dyn Error as a specialisation i.e. we could customise the error report there. However this requires opting in - as in, you need manually cast: tracing::event!(err as &dyn Error) to get that specialisation to kick in.

#[instrument(err)] does not perform this casting and there is no way to opt-in; so by default you always get Display (%) or opt-into Debug (?).

I think the open-telemetry integration with tracing-opentelemetry re-uses the string captured by tracing so we get the same outcome there.

Thanks for digging into it, that's an unfortunate outcome.
Maybe the Debug representation would be better then, even if less nice? At least we'd get the full error chain that way. Not sure there's much we can do, this is a tracing issue afaict.

Replacing thiserror in all our crates with a macro that generates an error report would have side effects unless we rollback our "source instead of Display" approach. If we make the Display impl of all errors an error report then we'd have duplicate error messages.

We also can't make our thiserror replacement only print an error report when core::fmt::Formatter::alternate is set because, based on your findings, we probably can't tell tracing to print the error as "{error:#}" either :(

The client has this problem on the Rust -> Wasm boundary, where one has to convert the Rust error types into JsValue and this is also done with Display.

Yeah, I've also kept this in the back of my mind since making the errors more consistent. In the CLI and library errors are surfaced and reported correctly, but for the web client crate there should be a utility that converts errors to String by reporting sources correctly.

for the web client crate there should be a utility that converts errors to String by reporting sources correctly.

Nice! Yeah, what works pretty well is to add something like a WasmResult trait which can be implemented for all errors that should be converted to JsValue and then you can just use ? for the conversion, which is pretty convenient. For inspiration, take a look at this file: https://github.com/iotaledger/identity.rs/blob/31d9f81c0cce18a5917c4677391d7ff46a0f3de2/bindings/wasm/src/error.rs.

Mirko-von-Leipzig · 2025-02-10T17:51:50Z

Trace of a successful block build:

and then one with an error injected (note that both the original span, and the root span are marked as errors):

bobbinth

Looks good! Thank you! I left some comments inline - the main ones are about naming convention.

crates/block-producer/src/mempool/mod.rs

crates/block-producer/src/block_builder/mod.rs

PhilippGackstatter

Looks good to me!

crates/block-producer/src/block_builder/mod.rs

PhilippGackstatter · 2025-02-11T12:42:24Z

crates/block-producer/src/block_builder/mod.rs

+            .and_then(|proven_block| async { self.inject_failure(proven_block) })
+            .and_then(|proven_block| self.commit_block(mempool, proven_block))
+            // Handle errors by propagating the error to the root span and rolling back the block.
+            .inspect_err(|err| Span::current().set_status(Status::Error { description: format!("{err:?}").into() }))


Thanks for digging into it, that's an unfortunate outcome.
Maybe the Debug representation would be better then, even if less nice? At least we'd get the full error chain that way. Not sure there's much we can do, this is a tracing issue afaict.

Replacing thiserror in all our crates with a macro that generates an error report would have side effects unless we rollback our "source instead of Display" approach. If we make the Display impl of all errors an error report then we'd have duplicate error messages.

We also can't make our thiserror replacement only print an error report when core::fmt::Formatter::alternate is set because, based on your findings, we probably can't tell tracing to print the error as "{error:#}" either :(

Restrain otel crates to utils by re-exporting

igamigo

LGTM. Screenshots look great!

crates/block-producer/src/mempool/mod.rs

Mirko-von-Leipzig force-pushed the mirko/instrument_block_builder branch from cd7e6e2 to 9f9faa5 Compare February 10, 2025 11:08

Mirko-von-Leipzig commented Feb 10, 2025

View reviewed changes

crates/block-producer/src/block_builder/mod.rs Outdated Show resolved Hide resolved

Mirko-von-Leipzig commented Feb 10, 2025

View reviewed changes

Mirko-von-Leipzig force-pushed the mirko/instrument_block_builder branch 2 times, most recently from 1ecdba1 to b917e80 Compare February 10, 2025 11:26

Instrument block builder

c984aa7

Mirko-von-Leipzig force-pushed the mirko/instrument_block_builder branch from b917e80 to c984aa7 Compare February 10, 2025 11:28

Mirko-von-Leipzig marked this pull request as ready for review February 10, 2025 11:31

Mirko-von-Leipzig commented Feb 10, 2025

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

Fix remote context spans

7edd355

Mirko-von-Leipzig commented Feb 10, 2025

View reviewed changes

bobbinth requested review from bobbinth and igamigo February 10, 2025 17:28

bobbinth approved these changes Feb 11, 2025

View reviewed changes

Address review comments

e126233

Mirko-von-Leipzig requested a review from PhilippGackstatter February 11, 2025 10:57

This was referenced Feb 11, 2025

Telemetry tracking issue #677

Open

Instrument batch building #679

Open

Custom error reporting in traces #683

Open

PhilippGackstatter approved these changes Feb 11, 2025

View reviewed changes

Mirko-von-Leipzig added 2 commits February 11, 2025 16:10

Restrain otel crates to utils by re-exporting

457881c

More review comments

cac66df

Restrain otel crates to utils by re-exporting

Mirko-von-Leipzig force-pushed the mirko/instrument_block_builder branch from 3fe2acb to cac66df Compare February 11, 2025 14:15

igamigo approved these changes Feb 11, 2025

View reviewed changes

crates/block-producer/src/mempool/mod.rs Show resolved Hide resolved

Mirko-von-Leipzig merged commit ebef215 into next Feb 11, 2025
11 checks passed

Mirko-von-Leipzig deleted the mirko/instrument_block_builder branch February 11, 2025 17:14

This was referenced Feb 11, 2025

Expose metrics for each service #144

Closed

Add healthchecking #145

Closed

Add custom OpenTelemetrySpanExt trait #700

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(block-producer): instrument block building #675

feat(block-producer): instrument block building #675

Mirko-von-Leipzig commented Feb 10, 2025 •

edited

Loading

Mirko-von-Leipzig Feb 10, 2025

Mirko-von-Leipzig Feb 10, 2025

Mirko-von-Leipzig commented Feb 10, 2025 •

edited

Loading

PhilippGackstatter commented Feb 10, 2025

Mirko-von-Leipzig commented Feb 10, 2025

Mirko-von-Leipzig Feb 10, 2025

PhilippGackstatter Feb 11, 2025

Mirko-von-Leipzig Feb 11, 2025

PhilippGackstatter Feb 11, 2025

Mirko-von-Leipzig Feb 11, 2025 •

edited

Loading

PhilippGackstatter Feb 11, 2025

igamigo Feb 11, 2025

PhilippGackstatter Feb 11, 2025

Mirko-von-Leipzig commented Feb 10, 2025

bobbinth left a comment

PhilippGackstatter left a comment

PhilippGackstatter Feb 11, 2025

igamigo left a comment

feat(block-producer): instrument block building #675

feat(block-producer): instrument block building #675

Conversation

Mirko-von-Leipzig commented Feb 10, 2025 • edited Loading

Implementation

Updated spans

Visual aid

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mirko-von-Leipzig commented Feb 10, 2025 • edited Loading

PhilippGackstatter commented Feb 10, 2025

Mirko-von-Leipzig commented Feb 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mirko-von-Leipzig Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

Relevant issues and PRs

Summary

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mirko-von-Leipzig commented Feb 10, 2025

bobbinth left a comment

Choose a reason for hiding this comment

PhilippGackstatter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

igamigo left a comment

Choose a reason for hiding this comment

Mirko-von-Leipzig commented Feb 10, 2025 •

edited

Loading

Mirko-von-Leipzig commented Feb 10, 2025 •

edited

Loading

Mirko-von-Leipzig Feb 11, 2025 •

edited

Loading