Optimize Exporting to Datadog #42

jeffutter · 2021-06-25T18:24:26Z

I have been running into situations on a couple of our services where the single process bottleneck call on SpandexDatadog.ApiServer.send_trace is causing significant latency under load. Furthermore, the timeout causes the caller to raise, which if you are calling from a telemetry handler (I imagine most people are), it will detach the handler which is not good.

I believe this is the same issue that #32 is attempting to solve. I'm taking a slightly different approach here to a similar end. The approach here is heavily inspired by/borrowed from opentelemetry. I hope the relation to that code can give this PR a little confidence - since people are likely using that in production.

Process

There are two buffer ets tables A and B
Trace gets inserted into a ets table A
Every 2 (configurable) seconds it flips to inserting into ets table B
A process gets started to send the traces
ets table A is given to the process and renamed
ets table A is recreated (so it can be swapped back to later)
The sender process uses :ets.select/3 to read the traces 100 at a time into a Stream
The stream converts the Traces to maps and then Msgpax.Fragments
The fragments are batched until they hit 10MB (DD limit)
The batch is Msgpax encoded and sent to DD with hackney
Once all the batches have been sent the exporter process dies and the ets table is cleaned up
Wash rinse repeat
The process that manages the buffers checks every 500ms to see if they have grown too large (more than 20mb) and if so it disables saving traces
Once the buffer is flushed it will reenable saving traces

Tradeoffs

Writing traces is fast (concurrent write ets table)
Sending traces happens out of the critical path, latency or errors won't impact actual traffic
The buffer size is checked on an interval, with enough traffic we could far exceed the max size before the manager thing notices and turns off tracing
When the buffer size is exceeded we drop traces.

I have had this code running in production for a couple of days and it seems solid. In the past 24 hours, we logged 138M traces and dropped 500k (0.4%). I posit this is better than the alternative where we timeout requests at 30 that would have otherwise succeeded and end up dropping some traces anyway because they timed out and were never sent.

Further Work

There are a couple of places this could be improved in the future better:

The DD agent can rate-limit and return a 429. It would be good to dynamically adjust the send_interval to be more aggressive until we get rate-limited and then back off. This would require adjusting the return of the http module to indicate rate-limiting and would be a breaking change.
It might be smart to put the formatting/encoding in the calling process and store the Msgpax fragments in the ets table, rather than the Trace structs.

Please let me know how this looks. I'm glad to provide more info or make further improvements.

Split formatting out into it's own module Clean up some variable names and pull out some anonymous functions.

jeffutter added 7 commits June 22, 2021 10:05

Initial buffer

53e8e80

Replace APIServer

721fdaf

Add telemetry for buffer size

24300d2

Actually run code to check buffer table sizes

fdbe7df

Read ets table with a stream to be more memory efficient

0a3a5af

Cleanup

c17164b

Split formatting out into it's own module Clean up some variable names and pull out some anonymous functions.

Fix typo

46f4923

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Exporting to Datadog #42

Optimize Exporting to Datadog #42

jeffutter commented Jun 25, 2021

Optimize Exporting to Datadog #42

Are you sure you want to change the base?

Optimize Exporting to Datadog #42

Conversation

jeffutter commented Jun 25, 2021

Process

Tradeoffs

Further Work