Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I have been running into situations on a couple of our services where the single process bottleneck call on
SpandexDatadog.ApiServer.send_trace
is causing significant latency under load. Furthermore, the timeout causes the caller to raise, which if you are calling from a telemetry handler (I imagine most people are), it will detach the handler which is not good.I believe this is the same issue that #32 is attempting to solve. I'm taking a slightly different approach here to a similar end. The approach here is heavily inspired by/borrowed from opentelemetry. I hope the relation to that code can give this PR a little confidence - since people are likely using that in production.
Process
ets
tables A and Bets
table Aets
table Bets
table A is given to the process and renamedets
table A is recreated (so it can be swapped back to later):ets.select/3
to read the traces 100 at a time into a StreamMsgpax.Fragment
sMsgpax
encoded and sent to DD with hackneyets
table is cleaned upTradeoffs
ets
table)I have had this code running in production for a couple of days and it seems solid. In the past 24 hours, we logged 138M traces and dropped 500k (0.4%). I posit this is better than the alternative where we timeout requests at 30 that would have otherwise succeeded and end up dropping some traces anyway because they timed out and were never sent.
Further Work
There are a couple of places this could be improved in the future better:
send_interval
to be more aggressive until we get rate-limited and then back off. This would require adjusting the return of thehttp
module to indicate rate-limiting and would be a breaking change.Msgpax
fragments in theets
table, rather than the Trace structs.Please let me know how this looks. I'm glad to provide more info or make further improvements.