-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(block-producer): instrument block building #675
Conversation
cd7e6e2
to
9f9faa5
Compare
#[derive(Clone)] | ||
pub struct SharedMempool(Arc<Mutex<Mempool>>); | ||
|
||
impl SharedMempool { | ||
#[instrument(target = COMPONENT, name = "mempool.lock", skip_all)] | ||
pub async fn lock(&self) -> MutexGuard<'_, Mempool> { | ||
self.0.lock().await | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stronger typing so we could always have the mempool.lock
span.
|
||
/// Panics if there is no block in flight. | ||
#[instrument(target = COMPONENT, name = "mempool.block.rollback", skip_all)] | ||
pub fn rollback_block(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer takes block_number
as this made state tracking more painful -- since the error case in the block builder wouldn't have direct access to the block number. We can revert this change, but I don't think it makes a difference at this stage.
1ecdba1
to
b917e80
Compare
b917e80
to
c984aa7
Compare
Something else that's missing is a transaction count for the block. This data isn't directly accessible via @PhilippGackstatter is the transaction list/count something we are going to add to the proven batch? I recall there was some discussion. |
Yeah, there is 0xPolygonMiden/miden-base#1121 which adds |
Its fine, we can wait on it. Its just nice pictures to look at. |
.and_then(|proven_block| async { self.inject_failure(proven_block) }) | ||
.and_then(|proven_block| self.commit_block(mempool, proven_block)) | ||
// Handle errors by propagating the error to the root span and rolling back the block. | ||
.inspect_err(|err| Span::current().set_status(Status::Error { description: format!("{err:?}").into() })) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of sucks, but tracing
doesn't give you an error report. One option is to wrap this error propagation here with anyhow
just for error reporting.
However there is a deeper issue, and that is that #[instrument(err)]
does not do error reporting either.
I'm unsure what the correct general solution is; possibly something to do with a custom trace layer/visitor/subscriber. I tried a few things but I'm giving up for now.
@PhilippGackstatter maybe you stumbled into something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a larger problem in the node and in the client for which I wanted to create issues but didn't have time for yet. Basically, format!("{error}");
is always wrong now, because it only displays the first error in the chain and effectively erases the (hopefully) useful source errors. (I guess in some cases we may want to erase details to not leak info, but that is already handled explicitly like in impl From<AddTransactionError> for tonic::Status
).
So wrapping with anyhow is one solution, or a small custom error report utility function is another. The client has this problem on the Rust -> Wasm boundary, where one has to convert the Rust error types into JsValue
and this is also done with Display
.
In my previous job I was also working on Wasm bindings for a Rust library and we implemented our own small utility error report function, which is just a couple of lines. We don't really need backtraces, which is where anyhow would be useful, so I think we don't necessarily need to depend on anyhow here.
We can add this to miden-base, because, ideally, we also change this in the client and then we have a central place to format these reports. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only issue is that the formatting solution doesn't work for #[instrument(err)]
because it just uses the display impl of the error. We would have to experiment with injecting the formatter into tracing itself - then it would be a good solution.
The more robust approach would be to create a new attribute macro to replace thiserror
which adds the error report as the display impl. If we remove thiserror
then this neatly solves this problem everywhere.
I'm actually somewhat leaning towards the macro option tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much experience with tracing, but I'm hoping there's a way to inject a formatter, maybe with https://docs.rs/tracing-error/latest/tracing_error/.
I think I saw an issue in thiserror repo asking if format!("{err:#}")
could produce a report, but I don't think that was implemented. Not sure if we should replace thiserror with our own macro though. It would be the exception to the rule in the Rust ecosystem and if we do ever include a "foreign" crate that uses regular thiserror
, then we'd have to deal with that exception too. This may already be the case with some dependencies in miden-base or miden-client. So to me the most universal solution seems to be the custom error reporter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more digging and injecting a custom error reporter is not possible for tracing at the moment.
iiuc tracing_error
injects span information into a new error type - and not the other way around. So its a way of creating a new error that has a form of backtrace via spans.
Relevant issues and PRs
- Syntax for recording errors in events and spans tokio-rs/tracing#3067
- attributes: support recording error as
std::error::Error
tokio-rs/tracing#3057 #[tracing::instrument(err)]
doesn't work with opentelemetry / datadog tokio-rs/tracing#2648
Summary
The issue is two-fold:
tracing
supportsdyn Error
as a specialisation i.e. we could customise the error report there. However this requires opting in - as in, you need manually cast:tracing::event!(err as &dyn Error)
to get that specialisation to kick in.#[instrument(err)]
does not perform this casting and there is no way to opt-in; so by default you always getDisplay
(%
) or opt-intoDebug
(?
).
I think the open-telemetry integration with tracing-opentelemetry
re-uses the string captured by tracing
so we get the same outcome there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into it, that's an unfortunate outcome.
Maybe the Debug
representation would be better then, even if less nice? At least we'd get the full error chain that way. Not sure there's much we can do, this is a tracing issue afaict.
Replacing thiserror in all our crates with a macro that generates an error report would have side effects unless we rollback our "source instead of Display" approach. If we make the Display
impl of all errors an error report then we'd have duplicate error messages.
We also can't make our thiserror replacement only print an error report when core::fmt::Formatter::alternate
is set because, based on your findings, we probably can't tell tracing to print the error as "{error:#}"
either :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client has this problem on the Rust -> Wasm boundary, where one has to convert the Rust error types into JsValue and this is also done with Display.
Yeah, I've also kept this in the back of my mind since making the errors more consistent. In the CLI and library errors are surfaced and reported correctly, but for the web client crate there should be a utility that converts errors to String by reporting sources correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the web client crate there should be a utility that converts errors to String by reporting sources correctly.
Nice! Yeah, what works pretty well is to add something like a WasmResult
trait which can be implemented for all errors that should be converted to JsValue
and then you can just use ?
for the conversion, which is pretty convenient. For inspiration, take a look at this file: https://github.com/iotaledger/identity.rs/blob/31d9f81c0cce18a5917c4677391d7ff46a0f3de2/bindings/wasm/src/error.rs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you! I left some comments inline - the main ones are about naming convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
.and_then(|proven_block| async { self.inject_failure(proven_block) }) | ||
.and_then(|proven_block| self.commit_block(mempool, proven_block)) | ||
// Handle errors by propagating the error to the root span and rolling back the block. | ||
.inspect_err(|err| Span::current().set_status(Status::Error { description: format!("{err:?}").into() })) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into it, that's an unfortunate outcome.
Maybe the Debug
representation would be better then, even if less nice? At least we'd get the full error chain that way. Not sure there's much we can do, this is a tracing issue afaict.
Replacing thiserror in all our crates with a macro that generates an error report would have side effects unless we rollback our "source instead of Display" approach. If we make the Display
impl of all errors an error report then we'd have duplicate error messages.
We also can't make our thiserror replacement only print an error report when core::fmt::Formatter::alternate
is set because, based on your findings, we probably can't tell tracing to print the error as "{error:#}"
either :(
Restrain otel crates to utils by re-exporting
3fe2acb
to
cac66df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Screenshots look great!
This PR reworks the instrumentation of the block builder to
I also fix remote span context not being sent (propagator was not being set).
Implementation
I've leant heavily on
#[instrument]
because manually instrumenting sync and async function calls is painful and error prone. The former requiresspan.enter()
andspan.exit()
, and the latter requiresasync_call.instrument(span).await
. This is easy to forget.The downside is that code must first be structured by function/stage because
#[instrument]
cannot be applied to code blocks{ ... }
. Though maybe this is an upside as well. I essentially structured the pipeline based on how I wanted the span information to be tracked.Updated spans
store.client.{method}
though maybe this should bestore.client/{method}
mempool.{method}
Visual aid
Visualized span result using
tracing-forest
(note: forest cannot trace across RPC boundaries so I've manually merged those in and marked them with*
).Note that I haven't really focussed on the store server aspect here, I'm on the fence about the naming convention to use. The standard mentions namespacing via periods aka
x.y.z
, but RPC routes should follow{package}.{service}/{method}
. All this to say that the store client/server have several styles currently.Tidbits:
mempool
i.e.mempool.block.xxx
vsmempool.block_xxx
.