Telemetry duration of async resolvers blown out by other resolvers #758

garthk · 2019-07-13T23:26:00Z

Split from #757:

the durations for asynchronous fields are for how long it took Absinthe's process to get the result, not how long it took to produce the result. I think the duration reported for an async field can blow out by up to the duration of the longest sync field present in the same query.

that's a fair point, but other than fields that use the async helper it's probably the right call. Consider something like batch or dataloader:

batch({SomeModule, :my_batch_fun}, id, fn results ->
  {:ok, Map.get(results, id)}
end)

The field isn't resolved when the batch function completes, it's resolved when the field finishes resolving, which requires {:ok, Map.get(results, id)} to run. That inner part can be arbitrarily complex and take up arbitrary time, which should be counted against it.

Agreed.

I'm still concerned people could get misled by the telemetry and any tracing built on it if the scheduling of the inner part can get delayed by other fields, and that delay is also counted against it. Does batch behave differently, somehow?

async doesn't really have an after the fact sync part, but at the same time it's also one of the least commonly used helpers since batch and dataloader are also effectively async while allowing for improved efficiency.

If's not fixable, and if batch doesn't suffer the same problems, should we actively encourage people to use batch?

The text was updated successfully, but these errors were encountered:

garthk · 2019-07-14T01:54:08Z

should we actively encourage people to use batch?

😂 such literally in @benwilson512 and @bruce's book:

… it's just not mentioned in the @moduledoc for either Absinthe.Resolution.Helpers or Absinthe.Middleware.Async. Asking for clarity in #absinthe-graphql.

garthk · 2019-07-14T02:33:02Z

I've tried batch. Same problem with telemetry results getting pushed out by sync fields, unless I got the pattern wrong:

  defmodule TestSchema do
    use Absinthe.Schema

    object :field_resolution_timing do
      field :begin_ms, :integer
      field :delay_ms, :integer
      field :end_ms, :integer
      field :label, :integer
    end

    query do
# …
      field :delay_batch, :field_resolution_timing do
        arg :delay_ms, :integer

        resolve fn _, %{delay_ms: delay_ms}, _ ->
          ref = make_ref()

          batch({__MODULE__, :batch_delay_and_report}, {ref, delay_ms}, fn results ->
            {:ok, results[ref]}
          end)
        end
      end
    end

    def batch_delay_and_report(_, args) do
      IO.inspect(args, label: "batch_delay_and_report args")
      Map.new(args, fn {ref, delay_ms} -> {ref, delay_and_report(:delay_batch, delay_ms)} end)
    end
# …
  end

I didn't read enough emphasis on “when the thing you're executing isn't sensibly batched”, which is the usual case when I'm having to hit REST APIs to get the job done.

Back to the bug, though, I think it's fair to say that for async and batch , the duration you get on [:absinthe, :resolve, :field] can get blown out by time spent on other fields, including:

fields that don't use async, batch, or dataloader
the inner part on batch fields, which can take up arbitrary time

If so, we need to document it somewhere so people trying to rely on the measurement know what it's measuring.

binaryseed · 2019-07-18T23:38:42Z

I'm not quite sure what alternative you are trying to suggest here...

Telemetry instrumentation tracks the time from start to finish of resolving a field. We shouldn't subtract time from that based on what kind of operation happened between start and finish.

If you need more detail about what's going on between start and finish, then that code should be traced as well.. Those other resolvers will also show up as spans that should overlap in time.

I think about tracing as about measuring code execution, so it's going to match what happens in absinthe code, not the shape of the GraphQL query or our mental representation of what "counts" as part of the duration.

garthk · 2019-07-24T23:41:30Z

I'm not sure we've any alternative other than to document it: the duration measures how long it took Absinthe to get to the result, not how long it took your code to get the result, and if you want to know the latter you'll need to measure it yourself.

Given a longer horizon, more telemetry on something similar to #757 could plausibly provide the latter for the user, but I reckon it'll be so full of corners any such telemetry should be left to the authors of whatever's doing the heavy lifting.

Speaking of #757: if you could straighten me out on that last question about the execution order, it'll totally let me cancel the PR and get out of your hair.

benwilson512 · 2019-07-25T01:53:14Z

Perhaps down the line the best model would be to invert the approach. Start a span when Absinthe's resolution loop starts doing work, and then end it right before it calls user code. Then if you overlayed that against the other spans you'd be able to tell that other work was happening.

benwilson512 · 2019-08-12T20:49:53Z

Anyone up for documenting this?

garthk · 2019-08-15T01:29:34Z

#770 is about as much as I can slap together this week, but it's a start.

binaryseed · 2019-09-15T01:20:01Z

OK to close this now?

garthk · 2019-09-15T22:45:45Z

Yeah, between this and the docs anyone confused can find out what's going on.

garthk mentioned this issue Aug 15, 2019

Document :telemetry hookup #770

Merged

garthk closed this as completed Sep 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telemetry duration of async resolvers blown out by other resolvers #758

Telemetry duration of async resolvers blown out by other resolvers #758

garthk commented Jul 13, 2019

garthk commented Jul 14, 2019

garthk commented Jul 14, 2019

binaryseed commented Jul 18, 2019

garthk commented Jul 24, 2019

benwilson512 commented Jul 25, 2019

benwilson512 commented Aug 12, 2019

garthk commented Aug 15, 2019

binaryseed commented Sep 15, 2019

garthk commented Sep 15, 2019

Telemetry duration of async resolvers blown out by other resolvers #758

Telemetry duration of async resolvers blown out by other resolvers #758

Comments

garthk commented Jul 13, 2019

garthk commented Jul 14, 2019

garthk commented Jul 14, 2019

binaryseed commented Jul 18, 2019

garthk commented Jul 24, 2019

benwilson512 commented Jul 25, 2019

benwilson512 commented Aug 12, 2019

garthk commented Aug 15, 2019

binaryseed commented Sep 15, 2019

garthk commented Sep 15, 2019