Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Exporting to Datadog #42

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

jeffutter
Copy link

I have been running into situations on a couple of our services where the single process bottleneck call on SpandexDatadog.ApiServer.send_trace is causing significant latency under load. Furthermore, the timeout causes the caller to raise, which if you are calling from a telemetry handler (I imagine most people are), it will detach the handler which is not good.

I believe this is the same issue that #32 is attempting to solve. I'm taking a slightly different approach here to a similar end. The approach here is heavily inspired by/borrowed from opentelemetry. I hope the relation to that code can give this PR a little confidence - since people are likely using that in production.

Process

  • There are two buffer ets tables A and B
  • Trace gets inserted into a ets table A
  • Every 2 (configurable) seconds it flips to inserting into ets table B
  • A process gets started to send the traces
  • ets table A is given to the process and renamed
  • ets table A is recreated (so it can be swapped back to later)
  • The sender process uses :ets.select/3 to read the traces 100 at a time into a Stream
  • The stream converts the Traces to maps and then Msgpax.Fragments
  • The fragments are batched until they hit 10MB (DD limit)
  • The batch is Msgpax encoded and sent to DD with hackney
  • Once all the batches have been sent the exporter process dies and the ets table is cleaned up
  • Wash rinse repeat
  • The process that manages the buffers checks every 500ms to see if they have grown too large (more than 20mb) and if so it disables saving traces
  • Once the buffer is flushed it will reenable saving traces

Tradeoffs

  • Writing traces is fast (concurrent write ets table)
  • Sending traces happens out of the critical path, latency or errors won't impact actual traffic
  • The buffer size is checked on an interval, with enough traffic we could far exceed the max size before the manager thing notices and turns off tracing
  • When the buffer size is exceeded we drop traces.

I have had this code running in production for a couple of days and it seems solid. In the past 24 hours, we logged 138M traces and dropped 500k (0.4%). I posit this is better than the alternative where we timeout requests at 30 that would have otherwise succeeded and end up dropping some traces anyway because they timed out and were never sent.

Further Work

There are a couple of places this could be improved in the future better:

  • The DD agent can rate-limit and return a 429. It would be good to dynamically adjust the send_interval to be more aggressive until we get rate-limited and then back off. This would require adjusting the return of the http module to indicate rate-limiting and would be a breaking change.
  • It might be smart to put the formatting/encoding in the calling process and store the Msgpax fragments in the ets table, rather than the Trace structs.

Please let me know how this looks. I'm glad to provide more info or make further improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant