WIP: Add async task wrapper for trace propagation. #757

garthk · 2019-07-09T07:09:52Z

I tried a few times to use middleware and extra phases to propagate trace information into the processes running my async resolvers, but couldn't. That leaves me with either wrapping each resolver myself, or intervening before Absinthe runs Task.async/1. If there's a cleaner way, I'm all ears. This method feels clumsy, and I don't yet know if it'll break with data loaders (opencensus-beam/opencensus_absinthe#21) or other parts of the ecosystem.

darrenclark · 2019-07-09T14:40:04Z

It feels a little dicey, but I've been playing around with using a ETS table to store the pid of the process doing the resolution & the span context, then using $callers to find the span context, something like:

defmodule AbsintheUtils.OpenCensus.OcTaskContext do
  @moduledoc """
  Stores a span_ctx in a ETS table, to be looked up via $callers

  Useful for Dataloader
  """

  use Task
  alias __MODULE__.EtsTable

  def create_ets_table do
    :ets.new(EtsTable, [
      :named_table,
      :set,
      :public,
      write_concurrency: true
    ])
  end

  def set(span_ctx) do
    :ets.insert(EtsTable, {self(), span_ctx})
  end

  def unset do
    :ets.delete(EtsTable, self())
  end

  def get do
    Process.get(:"$callers", [])
    |> Enum.find_value(fn pid ->
      case :ets.lookup(EtsTable, pid) do
        [] -> nil
        [{_, ctx}] -> ctx
      end
    end)
  end
end

I have absinthe phases that run after/before Opencensus.Absinthe.Phase.Push / Opencensus.Absinthe.Phase.Pop that call set / unset

When I need to find the span_ctx, get looks through $callers to find the span_ctx

That ETS table grows forever If unset doesn't get called (due to an exception / crash) though

garthk · 2019-07-09T23:27:40Z

I'd forgotten about $callers! Will totally steal that later. Tip in return: given a PID, you can reach into its process dictionary to retrieve any span information maintained by :ocp:

Process.info(pid) |> get_in([:dictionary, :oc_span_ctx_key])

You'll get nil, :undefined, or an :opencensus.span_ctx. Maybe that spares you the ETS table?

garthk · 2019-07-09T23:29:31Z

Zooming back out: while I could plausibly have each resolver function rummage through its $callers trying to find one with a oc_span_ctx_key, I'd strongly prefer to do that once in a plugin, not once per field.

darrenclark · 2019-07-10T01:20:31Z

@garthk I had went with the ETS table because I was under the impression that it was expensive to get at another process's dictionary. Specifically I had seen: https://github.com/erlang/otp/blob/master/erts/doc/src/communication.xml#L61-L74

I was specifically trying to add some tracing around Ecto, and $callers was 3 or 4 items long, so I was hoping to not have to wait on a 3 or 4 processes.

I have not benchmarked it, though. Maybe it's not much of a performance hit.

With accessing the process dictionary directly, I'm also not sure if we have to be concerned about any performance implications if one of the $callers is on another Erlang node.

binaryseed · 2019-07-12T19:12:22Z

This is exactly what the Telemetry integration is for.. it's in master and 1.5 already

If there are more fields that are needed (the field start event is a little bare right now), then we can talk about getting those passed to the Telemetry callbacks

garthk · 2019-07-13T00:08:31Z

@binaryseed, I've been working with a clone of your telemetry implementation against 1.4 and haven't yet managed to solve the whole problem. I trust you'll be able to spot whatever I'm missing, set me straight, and there'll prove to be no need for this awful PR?

We've got two broad challenges with tracing in 1.4:

Gathering information about the start and end time, and result, of each field with a resolver and the blueprint as a whole
Managing trace spans in such a way that when we create a span for an outbound request as part of resolving a field, we know that, and can use the field resolution span as a parent for a new span.

Your :telemetry hookup and my clone of it works for:

creating spans with appropriate start and end times for the blueprint
creating spans with appropriate start and end times for fields with resolvers
propagating the blueprint span for synchronous field resolution (process dictionary or seq_trace)
propagating the blueprint span for asynchronous field resolution (seq_trace only)

If you can straighten me out on one part of the resolution mechanism (see below), it'll also prove sufficient for:

propagating the field span for synchronous field resolution (process dictionary or seq_trace)
propagating the field span for asynchronous field resolution (seq_trace only)

That'd still leave:

propagating the blueprint span for asynchronous field resolution (process dictionary)
propagating the field span for asynchronous field resolution (process dictionary)

For propagation using the process dictionary, we need to either update every resolution function or replace Absinthe's Task.async use with something that wraps the resolution function in a closure so we can update the process dictionary in the child, hence this PR.

For propagation using seq_trace, it's sufficient to have the label set at the time Task.async is called, assuming it's the right label (again, see below). We're concerned that'll interfere with our users' existing use of seq_trace, if any: unlike the process dictionary, there's no namespace.

With either, I still don't understand the Absinthe resolution mechanism enough to be confident we can set either the pdict or trace label in the blueprint process to reflect the field we're working on at the time the resolver function starts.

You know it inside out. Is that possible, given only the existing phase and middleware extension points, and the 1.5 telemetry extension points?

garthk · 2019-07-13T10:26:37Z

Also, if I'm reading the results for the failing test in 65dfa8a right, the durations for asynchronous fields are for how long it took Absinthe's process to get the result, not how long it took to produce the result. I think the duration reported for an async field can blow out by up to the duration of the longest sync field present in the same query.

benwilson512 · 2019-07-13T12:56:20Z

@garthk that's a fair point, but other than fields that use the async helper it's probably the right call. Consider something like batch or dataloader:

batch({SomeModule, :my_batch_fun}, id, fn results ->
  {:ok, Map.get(results, id)}
end)

The field isn't resolved when the batch function completes, it's resolved when the field finishes resolving, which requires {:ok, Map.get(results, id)} to run. That inner part can be arbitrarily complex and take up arbitrary time, which should be counted against it.

async doesn't really have an after the fact sync part, but at the same time it's also one of the least commonly used helpers since batch and dataloader are also effectively async while allowing for improved efficiency.

We've got two broad challenges with tracing in 1.4:

I don't think targeting 1.4 makes sense. If there are things we can improve with respect to tracing in 1.5 great, let's work on that, but 1.5 is so close to being releasable that I don't want to distract the core team with significant 1.4 reworks.

bruce · 2019-07-13T18:32:34Z

I agree with @benwilson512 here; there will be no compelling reasons to stick with v1.4 once v1.5 is out (especially in light of some significant performance optimizations and the fact v1.5 is largely backwards-compatible), and there aren't any plans to backport features—just changes relating to security.

garthk · 2019-07-13T23:32:55Z

I don't think targeting 1.4 makes sense

Oh, agreed. I'm not targeting 1.4. I have 1.4 in prod, and I'm describing the problems I'm having with it, and I'm being specific because I don't want to mislead anyone.

I'm fine concentrating on 1.5. If we can solve the trace propagation in 1.5 with the current telemetry design, I'll be thrilled. Hence, all the Q&A. Well, mostly Q.

If I use an [:absinthe, :resolve, :field, :start] handler to capture field details, what's the risk that the handler will get called again for some other field before the resolver for that field is called, whether sync or async or batch or dataloader?

binaryseed · 2019-09-15T01:20:11Z

OK to close this now?

garthk · 2019-09-15T22:26:16Z

Yep. If time permits, I'll swing back around later with a more generic way to hook all Absinthe's execution of user-supplied code so all us instrumentation etc authors can share the load of figuring out how to hook in. Better yet, we can upstream it to Absinthe once stable.

Unlikely to be this month or next month, but, unless things get hairy and I need to upgrade the introspection to lock it down.

WIP: Add async task wrapper for trace propagation.

bc07886

Add test with sinking feeling.

65dfa8a

garthk mentioned this pull request Jul 13, 2019

Telemetry duration of async resolvers blown out by other resolvers #758

Closed

garthk mentioned this pull request Jul 24, 2019

Behaviour for process context management? opencensus-beam/opencensus_elixir#23

Closed

garthk mentioned this pull request Aug 11, 2019

Adding handling for exceptions in resolver error status tuples #509

Closed

garthk closed this Sep 15, 2019

garthk deleted the feature/async-task-wrapper branch September 15, 2019 22:26

garthk restored the feature/async-task-wrapper branch September 15, 2019 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add async task wrapper for trace propagation. #757

WIP: Add async task wrapper for trace propagation. #757

garthk commented Jul 9, 2019

darrenclark commented Jul 9, 2019 •

edited

Loading

garthk commented Jul 9, 2019

garthk commented Jul 9, 2019

darrenclark commented Jul 10, 2019

binaryseed commented Jul 12, 2019

garthk commented Jul 13, 2019

garthk commented Jul 13, 2019

benwilson512 commented Jul 13, 2019

bruce commented Jul 13, 2019 •

edited

Loading

garthk commented Jul 13, 2019 •

edited

Loading

binaryseed commented Sep 15, 2019

garthk commented Sep 15, 2019

WIP: Add async task wrapper for trace propagation. #757

WIP: Add async task wrapper for trace propagation. #757

Conversation

garthk commented Jul 9, 2019

darrenclark commented Jul 9, 2019 • edited Loading

garthk commented Jul 9, 2019

garthk commented Jul 9, 2019

darrenclark commented Jul 10, 2019

binaryseed commented Jul 12, 2019

garthk commented Jul 13, 2019

garthk commented Jul 13, 2019

benwilson512 commented Jul 13, 2019

bruce commented Jul 13, 2019 • edited Loading

garthk commented Jul 13, 2019 • edited Loading

binaryseed commented Sep 15, 2019

garthk commented Sep 15, 2019

darrenclark commented Jul 9, 2019 •

edited

Loading

bruce commented Jul 13, 2019 •

edited

Loading

garthk commented Jul 13, 2019 •

edited

Loading