Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(block-producer): instrument block building #675

Merged
merged 5 commits into from
Feb 11, 2025

Conversation

Mirko-von-Leipzig
Copy link
Contributor

@Mirko-von-Leipzig Mirko-von-Leipzig commented Feb 10, 2025

This PR reworks the instrumentation of the block builder to

  1. align it more closely with open-telemetry standards, and
  2. improve the telemetry and span information emitted

I also fix remote span context not being sent (propagator was not being set).

Implementation

I've leant heavily on #[instrument] because manually instrumenting sync and async function calls is painful and error prone. The former requires span.enter() and span.exit(), and the latter requires async_call.instrument(span).await. This is easy to forget.

The downside is that code must first be structured by function/stage because #[instrument] cannot be applied to code blocks { ... }. Though maybe this is an upside as well. I essentially structured the pipeline based on how I wanted the span information to be tracked.

Updated spans

  • refactored block building to better structure spans
  • store client to store.client.{method} though maybe this should be store.client/{method}
  • mempool to mempool.{method}

Visual aid

Visualized span result using tracing-forest (note: forest cannot trace across RPC boundaries so I've manually merged those in and marked them with *).

Note that I haven't really focussed on the store server aspect here, I'm on the fence about the naming convention to use. The standard mentions namespacing via periods aka x.y.z, but RPC routes should follow {package}.{service}/{method}. All this to say that the store client/server have several styles currently.

INFO     block_builder.build_block [ 1.62ms | 3.67% / 100.00% ]
INFO     ┝━ block_builder.select_block [ 28.6µs | 0.28% / 1.76% ]
INFO     │  ┝━ mempool.lock [ 957ns | 0.06% ]
INFO     │  ┕━ mempool.block.select [ 23.1µs | 1.43% ]
INFO     ┝━ block_builder.get_block_inputs [ 144µs | 0.23% / 8.89% ]
INFO     │  ┝━ block_builder.summarize_batches [ 958ns | 0.06% ]
INFO     │  ┕━ store.client.get_block_inputs [ 139µs | 8.60% ]
INFO     │     ┕━ * store.rpc/GetBlockInputs [ 181µs | 27.20% / 100.00% ]
INFO     │        * ┕━ store.server.get_block_inputs [ 132µs | 54.00% / 72.80% ]
INFO     │        *   ┝━ validate_nullifiers [ 958ns | 0.53% ]
INFO     │        *   ┝━ read_account_ids [ 375ns | 0.21% ]
INFO     │        *   ┝━ validate_notes [ 583ns | 0.32% ]
INFO     │        *   ┝━ select_block_header_by_block_num [ 11.5µs | 6.37% ]
INFO     │        *   ┝━ select_note_inclusion_proofs [ 12.3µs | 6.83% ]
INFO     │        *   ┕━ select_block_headers [ 8.21µs | 4.54% ]
INFO     ┝━ block_builder.prove_block [ 1.34ms | 25.23% / 82.92% ]
INFO     │  ┝━ execute_program [ 926µs | 57.10% ]
INFO     │  ┕━ block_builder.simulate_proving [ 9.50µs | 0.59% ]
INFO     ┝━ block_builder.inject_failure [ 3.38µs | 0.21% ]
INFO     ┕━ block_builder.commit_block [ 41.4µs | 0.13% / 2.56% ]
INFO        ┝━ store.client.apply_block [ 23.3µs | 1.44% ]
INFO        │   ┕━ * store.rpc/ApplyBlock [ 182µs | 24.24% / 100.00% ]
INFO        │      * ┕━ store.server.apply_block [ 138µs | 24.95% / 75.76% ]
INFO        │      *    ┕━ apply_block [ 92.5µs | 27.65% / 50.80% ]
INFO        │      *       ┝━ select_block_header_by_block_num [ 10.3µs | 5.65% ]
INFO        │      *       ┝━ update_in_memory_structs [ 31.9µs | 17.50% ]
INFO        │      *       ┕━ i [info]: apply_block successful
INFO        ┝━ mempool.lock [ 250ns | 0.02% ]
INFO        ┕━ mempool.block.commit [ 15.8µs | 0.85% / 0.97% ]
INFO           ┕━ mempool.transactions.revert [ 2.00µs | 0.12% ]

Tidbits:

  • I included the lock time of the mempool so we can see contention. I think this will make an interesting stat to monitor.
  • I'm unsure about sub-namespacing the mempool i.e. mempool.block.xxx vs mempool.block_xxx.
  • The transactions revert in mempool is due to the tx expiring - I should maybe add a span there to make it obvious?

@Mirko-von-Leipzig Mirko-von-Leipzig force-pushed the mirko/instrument_block_builder branch from cd7e6e2 to 9f9faa5 Compare February 10, 2025 11:08
Comment on lines +130 to +138
#[derive(Clone)]
pub struct SharedMempool(Arc<Mutex<Mempool>>);

impl SharedMempool {
#[instrument(target = COMPONENT, name = "mempool.lock", skip_all)]
pub async fn lock(&self) -> MutexGuard<'_, Mempool> {
self.0.lock().await
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stronger typing so we could always have the mempool.lock span.


/// Panics if there is no block in flight.
#[instrument(target = COMPONENT, name = "mempool.block.rollback", skip_all)]
pub fn rollback_block(&mut self) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer takes block_number as this made state tracking more painful -- since the error case in the block builder wouldn't have direct access to the block number. We can revert this change, but I don't think it makes a difference at this stage.

@Mirko-von-Leipzig Mirko-von-Leipzig force-pushed the mirko/instrument_block_builder branch 2 times, most recently from 1ecdba1 to b917e80 Compare February 10, 2025 11:26
@Mirko-von-Leipzig Mirko-von-Leipzig force-pushed the mirko/instrument_block_builder branch from b917e80 to c984aa7 Compare February 10, 2025 11:28
@Mirko-von-Leipzig Mirko-von-Leipzig marked this pull request as ready for review February 10, 2025 11:31
@Mirko-von-Leipzig
Copy link
Contributor Author

Mirko-von-Leipzig commented Feb 10, 2025

Something else that's missing is a transaction count for the block. This data isn't directly accessible via ProvenBatch though we could maybe pull it in using the mempool information. I don't really like this idea though - feels like coupling.

@PhilippGackstatter is the transaction list/count something we are going to add to the proven batch? I recall there was some discussion.

@PhilippGackstatter
Copy link

@PhilippGackstatter is the transaction list/count something we are going to add to the proven batch? I recall there was some discussion.

Yeah, there is 0xPolygonMiden/miden-base#1121 which adds VerifiedTransaction which, would be added to ProvenBatch. Then we'd have the tx count. This seems to be planned vor v0.9. Would that be good enough or should we add the transaction count before that?

@Mirko-von-Leipzig
Copy link
Contributor Author

@PhilippGackstatter is the transaction list/count something we are going to add to the proven batch? I recall there was some discussion.

Yeah, there is 0xPolygonMiden/miden-base#1121 which adds VerifiedTransaction which, would be added to ProvenBatch. Then we'd have the tx count. This seems to be planned vor v0.9. Would that be good enough or should we add the transaction count before that?

Its fine, we can wait on it. Its just nice pictures to look at.

.and_then(|proven_block| async { self.inject_failure(proven_block) })
.and_then(|proven_block| self.commit_block(mempool, proven_block))
// Handle errors by propagating the error to the root span and rolling back the block.
.inspect_err(|err| Span::current().set_status(Status::Error { description: format!("{err:?}").into() }))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of sucks, but tracing doesn't give you an error report. One option is to wrap this error propagation here with anyhow just for error reporting.

However there is a deeper issue, and that is that #[instrument(err)] does not do error reporting either.

I'm unsure what the correct general solution is; possibly something to do with a custom trace layer/visitor/subscriber. I tried a few things but I'm giving up for now.

@PhilippGackstatter maybe you stumbled into something?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a larger problem in the node and in the client for which I wanted to create issues but didn't have time for yet. Basically, format!("{error}"); is always wrong now, because it only displays the first error in the chain and effectively erases the (hopefully) useful source errors. (I guess in some cases we may want to erase details to not leak info, but that is already handled explicitly like in impl From<AddTransactionError> for tonic::Status).
So wrapping with anyhow is one solution, or a small custom error report utility function is another. The client has this problem on the Rust -> Wasm boundary, where one has to convert the Rust error types into JsValue and this is also done with Display.

In my previous job I was also working on Wasm bindings for a Rust library and we implemented our own small utility error report function, which is just a couple of lines. We don't really need backtraces, which is where anyhow would be useful, so I think we don't necessarily need to depend on anyhow here.

We can add this to miden-base, because, ideally, we also change this in the client and then we have a central place to format these reports. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only issue is that the formatting solution doesn't work for #[instrument(err)] because it just uses the display impl of the error. We would have to experiment with injecting the formatter into tracing itself - then it would be a good solution.

The more robust approach would be to create a new attribute macro to replace thiserror which adds the error report as the display impl. If we remove thiserror then this neatly solves this problem everywhere.

I'm actually somewhat leaning towards the macro option tbh.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much experience with tracing, but I'm hoping there's a way to inject a formatter, maybe with https://docs.rs/tracing-error/latest/tracing_error/.

I think I saw an issue in thiserror repo asking if format!("{err:#}") could produce a report, but I don't think that was implemented. Not sure if we should replace thiserror with our own macro though. It would be the exception to the rule in the Rust ecosystem and if we do ever include a "foreign" crate that uses regular thiserror, then we'd have to deal with that exception too. This may already be the case with some dependencies in miden-base or miden-client. So to me the most universal solution seems to be the custom error reporter.

Copy link
Contributor Author

@Mirko-von-Leipzig Mirko-von-Leipzig Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more digging and injecting a custom error reporter is not possible for tracing at the moment.

iiuc tracing_error injects span information into a new error type - and not the other way around. So its a way of creating a new error that has a form of backtrace via spans.

Relevant issues and PRs

Summary

The issue is two-fold:

  1. tracing supports dyn Error as a specialisation i.e. we could customise the error report there. However this requires opting in - as in, you need manually cast: tracing::event!(err as &dyn Error) to get that specialisation to kick in.
  2. #[instrument(err)] does not perform this casting and there is no way to opt-in; so by default you always get Display (%) or opt-into Debug (?).

I think the open-telemetry integration with tracing-opentelemetry re-uses the string captured by tracing so we get the same outcome there.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into it, that's an unfortunate outcome.
Maybe the Debug representation would be better then, even if less nice? At least we'd get the full error chain that way. Not sure there's much we can do, this is a tracing issue afaict.

Replacing thiserror in all our crates with a macro that generates an error report would have side effects unless we rollback our "source instead of Display" approach. If we make the Display impl of all errors an error report then we'd have duplicate error messages.

We also can't make our thiserror replacement only print an error report when core::fmt::Formatter::alternate is set because, based on your findings, we probably can't tell tracing to print the error as "{error:#}" either :(

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client has this problem on the Rust -> Wasm boundary, where one has to convert the Rust error types into JsValue and this is also done with Display.

Yeah, I've also kept this in the back of my mind since making the errors more consistent. In the CLI and library errors are surfaced and reported correctly, but for the web client crate there should be a utility that converts errors to String by reporting sources correctly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the web client crate there should be a utility that converts errors to String by reporting sources correctly.

Nice! Yeah, what works pretty well is to add something like a WasmResult trait which can be implemented for all errors that should be converted to JsValue and then you can just use ? for the conversion, which is pretty convenient. For inspiration, take a look at this file: https://github.com/iotaledger/identity.rs/blob/31d9f81c0cce18a5917c4677391d7ff46a0f3de2/bindings/wasm/src/error.rs.

@bobbinth bobbinth requested review from bobbinth and igamigo February 10, 2025 17:28
@Mirko-von-Leipzig
Copy link
Contributor Author

Trace of a successful block build:

image

and then one with an error injected (note that both the original span, and the root span are marked as errors):

image

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! I left some comments inline - the main ones are about naming convention.

Copy link

@PhilippGackstatter PhilippGackstatter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

.and_then(|proven_block| async { self.inject_failure(proven_block) })
.and_then(|proven_block| self.commit_block(mempool, proven_block))
// Handle errors by propagating the error to the root span and rolling back the block.
.inspect_err(|err| Span::current().set_status(Status::Error { description: format!("{err:?}").into() }))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into it, that's an unfortunate outcome.
Maybe the Debug representation would be better then, even if less nice? At least we'd get the full error chain that way. Not sure there's much we can do, this is a tracing issue afaict.

Replacing thiserror in all our crates with a macro that generates an error report would have side effects unless we rollback our "source instead of Display" approach. If we make the Display impl of all errors an error report then we'd have duplicate error messages.

We also can't make our thiserror replacement only print an error report when core::fmt::Formatter::alternate is set because, based on your findings, we probably can't tell tracing to print the error as "{error:#}" either :(

@Mirko-von-Leipzig Mirko-von-Leipzig force-pushed the mirko/instrument_block_builder branch from 3fe2acb to cac66df Compare February 11, 2025 14:15
Copy link
Collaborator

@igamigo igamigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Screenshots look great!

@Mirko-von-Leipzig Mirko-von-Leipzig merged commit ebef215 into next Feb 11, 2025
11 checks passed
@Mirko-von-Leipzig Mirko-von-Leipzig deleted the mirko/instrument_block_builder branch February 11, 2025 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants