Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add async task wrapper for trace propagation. #757

Closed

Conversation

garthk
Copy link
Contributor

@garthk garthk commented Jul 9, 2019

I tried a few times to use middleware and extra phases to propagate trace information into the processes running my async resolvers, but couldn't. That leaves me with either wrapping each resolver myself, or intervening before Absinthe runs Task.async/1. If there's a cleaner way, I'm all ears. This method feels clumsy, and I don't yet know if it'll break with data loaders (opencensus-beam/opencensus_absinthe#21) or other parts of the ecosystem.

@darrenclark
Copy link
Contributor

darrenclark commented Jul 9, 2019

It feels a little dicey, but I've been playing around with using a ETS table to store the pid of the process doing the resolution & the span context, then using $callers to find the span context, something like:

defmodule AbsintheUtils.OpenCensus.OcTaskContext do
  @moduledoc """
  Stores a span_ctx in a ETS table, to be looked up via $callers

  Useful for Dataloader
  """

  use Task
  alias __MODULE__.EtsTable

  def create_ets_table do
    :ets.new(EtsTable, [
      :named_table,
      :set,
      :public,
      write_concurrency: true
    ])
  end

  def set(span_ctx) do
    :ets.insert(EtsTable, {self(), span_ctx})
  end

  def unset do
    :ets.delete(EtsTable, self())
  end

  def get do
    Process.get(:"$callers", [])
    |> Enum.find_value(fn pid ->
      case :ets.lookup(EtsTable, pid) do
        [] -> nil
        [{_, ctx}] -> ctx
      end
    end)
  end
end

I have absinthe phases that run after/before Opencensus.Absinthe.Phase.Push / Opencensus.Absinthe.Phase.Pop that call set / unset

When I need to find the span_ctx, get looks through $callers to find the span_ctx

That ETS table grows forever If unset doesn't get called (due to an exception / crash) though

@garthk
Copy link
Contributor Author

garthk commented Jul 9, 2019

I'd forgotten about $callers! Will totally steal that later. Tip in return: given a PID, you can reach into its process dictionary to retrieve any span information maintained by :ocp:

Process.info(pid) |> get_in([:dictionary, :oc_span_ctx_key])

You'll get nil, :undefined, or an :opencensus.span_ctx. Maybe that spares you the ETS table?

@garthk
Copy link
Contributor Author

garthk commented Jul 9, 2019

Zooming back out: while I could plausibly have each resolver function rummage through its $callers trying to find one with a oc_span_ctx_key, I'd strongly prefer to do that once in a plugin, not once per field.

@darrenclark
Copy link
Contributor

@garthk I had went with the ETS table because I was under the impression that it was expensive to get at another process's dictionary. Specifically I had seen: https://github.com/erlang/otp/blob/master/erts/doc/src/communication.xml#L61-L74

I was specifically trying to add some tracing around Ecto, and $callers was 3 or 4 items long, so I was hoping to not have to wait on a 3 or 4 processes.

I have not benchmarked it, though. Maybe it's not much of a performance hit.

With accessing the process dictionary directly, I'm also not sure if we have to be concerned about any performance implications if one of the $callers is on another Erlang node.

@binaryseed
Copy link
Contributor

This is exactly what the Telemetry integration is for.. it's in master and 1.5 already

If there are more fields that are needed (the field start event is a little bare right now), then we can talk about getting those passed to the Telemetry callbacks

@garthk
Copy link
Contributor Author

garthk commented Jul 13, 2019

@binaryseed, I've been working with a clone of your telemetry implementation against 1.4 and haven't yet managed to solve the whole problem. I trust you'll be able to spot whatever I'm missing, set me straight, and there'll prove to be no need for this awful PR?

We've got two broad challenges with tracing in 1.4:

  1. Gathering information about the start and end time, and result, of each field with a resolver and the blueprint as a whole

  2. Managing trace spans in such a way that when we create a span for an outbound request as part of resolving a field, we know that, and can use the field resolution span as a parent for a new span.

Your :telemetry hookup and my clone of it works for:

  • creating spans with appropriate start and end times for the blueprint
  • creating spans with appropriate start and end times for fields with resolvers
  • propagating the blueprint span for synchronous field resolution (process dictionary or seq_trace)
  • propagating the blueprint span for asynchronous field resolution (seq_trace only)

If you can straighten me out on one part of the resolution mechanism (see below), it'll also prove sufficient for:

  • propagating the field span for synchronous field resolution (process dictionary or seq_trace)
  • propagating the field span for asynchronous field resolution (seq_trace only)

That'd still leave:

  • propagating the blueprint span for asynchronous field resolution (process dictionary)
  • propagating the field span for asynchronous field resolution (process dictionary)

For propagation using the process dictionary, we need to either update every resolution function or replace Absinthe's Task.async use with something that wraps the resolution function in a closure so we can update the process dictionary in the child, hence this PR.

For propagation using seq_trace, it's sufficient to have the label set at the time Task.async is called, assuming it's the right label (again, see below). We're concerned that'll interfere with our users' existing use of seq_trace, if any: unlike the process dictionary, there's no namespace.

With either, I still don't understand the Absinthe resolution mechanism enough to be confident we can set either the pdict or trace label in the blueprint process to reflect the field we're working on at the time the resolver function starts.

You know it inside out. Is that possible, given only the existing phase and middleware extension points, and the 1.5 telemetry extension points?

@garthk
Copy link
Contributor Author

garthk commented Jul 13, 2019

Also, if I'm reading the results for the failing test in 65dfa8a right, the durations for asynchronous fields are for how long it took Absinthe's process to get the result, not how long it took to produce the result. I think the duration reported for an async field can blow out by up to the duration of the longest sync field present in the same query.

@benwilson512
Copy link
Contributor

@garthk that's a fair point, but other than fields that use the async helper it's probably the right call. Consider something like batch or dataloader:

batch({SomeModule, :my_batch_fun}, id, fn results ->
  {:ok, Map.get(results, id)}
end)

The field isn't resolved when the batch function completes, it's resolved when the field finishes resolving, which requires {:ok, Map.get(results, id)} to run. That inner part can be arbitrarily complex and take up arbitrary time, which should be counted against it.

async doesn't really have an after the fact sync part, but at the same time it's also one of the least commonly used helpers since batch and dataloader are also effectively async while allowing for improved efficiency.

We've got two broad challenges with tracing in 1.4:

I don't think targeting 1.4 makes sense. If there are things we can improve with respect to tracing in 1.5 great, let's work on that, but 1.5 is so close to being releasable that I don't want to distract the core team with significant 1.4 reworks.

@bruce
Copy link
Contributor

bruce commented Jul 13, 2019

I agree with @benwilson512 here; there will be no compelling reasons to stick with v1.4 once v1.5 is out (especially in light of some significant performance optimizations and the fact v1.5 is largely backwards-compatible), and there aren't any plans to backport features—just changes relating to security.

@garthk
Copy link
Contributor Author

garthk commented Jul 13, 2019

I don't think targeting 1.4 makes sense

Oh, agreed. I'm not targeting 1.4. I have 1.4 in prod, and I'm describing the problems I'm having with it, and I'm being specific because I don't want to mislead anyone.

I'm fine concentrating on 1.5. If we can solve the trace propagation in 1.5 with the current telemetry design, I'll be thrilled. Hence, all the Q&A. Well, mostly Q.

If I use an [:absinthe, :resolve, :field, :start] handler to capture field details, what's the risk that the handler will get called again for some other field before the resolver for that field is called, whether sync or async or batch or dataloader?

@binaryseed
Copy link
Contributor

OK to close this now?

@garthk
Copy link
Contributor Author

garthk commented Sep 15, 2019

Yep. If time permits, I'll swing back around later with a more generic way to hook all Absinthe's execution of user-supplied code so all us instrumentation etc authors can share the load of figuring out how to hook in. Better yet, we can upstream it to Absinthe once stable.

Unlikely to be this month or next month, but, unless things get hairy and I need to upgrade the introspection to lock it down.

@garthk garthk closed this Sep 15, 2019
@garthk garthk deleted the feature/async-task-wrapper branch September 15, 2019 22:26
@garthk garthk restored the feature/async-task-wrapper branch September 15, 2019 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants