-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telemetry duration of async resolvers blown out by other resolvers #758
Comments
😂 such literally in @benwilson512 and @bruce's book: … it's just not mentioned in the |
I've tried defmodule TestSchema do
use Absinthe.Schema
object :field_resolution_timing do
field :begin_ms, :integer
field :delay_ms, :integer
field :end_ms, :integer
field :label, :integer
end
query do
# …
field :delay_batch, :field_resolution_timing do
arg :delay_ms, :integer
resolve fn _, %{delay_ms: delay_ms}, _ ->
ref = make_ref()
batch({__MODULE__, :batch_delay_and_report}, {ref, delay_ms}, fn results ->
{:ok, results[ref]}
end)
end
end
end
def batch_delay_and_report(_, args) do
IO.inspect(args, label: "batch_delay_and_report args")
Map.new(args, fn {ref, delay_ms} -> {ref, delay_and_report(:delay_batch, delay_ms)} end)
end
# …
end I didn't read enough emphasis on “when the thing you're executing isn't sensibly batched”, which is the usual case when I'm having to hit REST APIs to get the job done. Back to the bug, though, I think it's fair to say that for
If so, we need to document it somewhere so people trying to rely on the measurement know what it's measuring. |
I'm not quite sure what alternative you are trying to suggest here... Telemetry instrumentation tracks the time from start to finish of resolving a field. We shouldn't subtract time from that based on what kind of operation happened between start and finish. If you need more detail about what's going on between start and finish, then that code should be traced as well.. Those other resolvers will also show up as spans that should overlap in time. I think about tracing as about measuring code execution, so it's going to match what happens in absinthe code, not the shape of the GraphQL query or our mental representation of what "counts" as part of the duration. |
I'm not sure we've any alternative other than to document it: the duration measures how long it took Absinthe to get to the result, not how long it took your code to get the result, and if you want to know the latter you'll need to measure it yourself. Given a longer horizon, more telemetry on something similar to #757 could plausibly provide the latter for the user, but I reckon it'll be so full of corners any such telemetry should be left to the authors of whatever's doing the heavy lifting. Speaking of #757: if you could straighten me out on that last question about the execution order, it'll totally let me cancel the PR and get out of your hair. |
Perhaps down the line the best model would be to invert the approach. Start a span when Absinthe's resolution loop starts doing work, and then end it right before it calls user code. Then if you overlayed that against the other spans you'd be able to tell that other work was happening. |
Anyone up for documenting this? |
#770 is about as much as I can slap together this week, but it's a start. |
OK to close this now? |
Yeah, between this and the docs anyone confused can find out what's going on. |
Split from #757:
Agreed.
I'm still concerned people could get misled by the telemetry and any tracing built on it if the scheduling of the inner part can get delayed by other fields, and that delay is also counted against it. Does batch behave differently, somehow?
If's not fixable, and if batch doesn't suffer the same problems, should we actively encourage people to use batch?
The text was updated successfully, but these errors were encountered: