Datadog spans explosion after v1.53.x #6355

davidegreenwald · 2024-11-28T00:25:08Z

Describe the bug

We use the Inigo build of the router and recently upgraded from v0.30.11 to 0.30.15 (bringing us from router 1.53.x to 1.57.1).

The Datadog changes across these versions appear to have exploded our span count in Datadog. The router appears to be ignoring the sampling rates on its upstream, parent services which it was previously respecting and is now ingesting all spans. This has lifted our ingest by 100x and will have a cost impact on our Datadog contract.

We're using the Datadog trace exporter: https://www.apollographql.com/docs/graphos/reference/router/telemetry/trace-exporters/datadog

Expected behavior

There should be no change in trace levels from these upgrades.

Additional context

Happy to give you any further information we can here.

str1aN · 2024-12-09T22:27:17Z

We are experiencing this issue too. We have had to disable Datadog trace export entirely.

BrynCooke · 2024-12-11T12:35:03Z

Hi,
Would it be possible to try: #6112 to see if it fixes things for you both?
We're trying to get a strong signal on this before merging as this has been very tricky to fix.

The background to this is that Datadog has an unusual way of dealing with traces requiring them to be sent to the agent even if they are not sampled.

The PR should allow users to have this behaviour while also passing the sampling priority downstream correctly.

davidegreenwald · 2024-12-11T18:41:22Z

@BrynCooke Thank you, that's great to know. We won't be able to test this until January due to holidays and PTO but can take a look then and report back.

Would it be helpful to try to put you in contact with our Datadog rep?

BrynCooke · 2024-12-13T15:27:06Z

@davidegreenwald Thank you for the offer but I'm not sure it would help.
The issue is that there is a mismatch between the way that Otel works and the way that Datadog works. Otel doesn't support various parts of the Datadog behaviour, and even if it does it's not well tested.

Longer term it would be great if Datadog promoted the Otel standard and made this their preferred ingestion mechanism, but I'm not sure it would be in their commercial interests. The alternative has been that we reverse engineer the Datadog protocols and behaviour to try and make everything play nicely.

We're going to try and upstream some of the work that was done in this PR, in particular the way that sampling is handled and PSR is propagated.

BrynCooke · 2024-12-20T15:57:24Z

I just wanted to flag here that we are have a PR for a new release with the telemetry fixes in:

#6483

It contains another fix over and above #6112

Internal testing and testing by select users seems to indicates that this issue is fixed.

davidegreenwald · 2024-12-20T18:43:03Z

@BrynCooke Thank you!

davidegreenwald added the raised by user label Nov 28, 2024

abernix mentioned this issue Dec 13, 2024

Add preview_datadog_agent_sampling #6112

Merged

6 tasks

BrynCooke closed this as completed in #6112 Dec 16, 2024

BrynCooke reopened this Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datadog spans explosion after v1.53.x #6355

Datadog spans explosion after v1.53.x #6355

davidegreenwald commented Nov 28, 2024

str1aN commented Dec 9, 2024

BrynCooke commented Dec 11, 2024 •

edited

Loading

davidegreenwald commented Dec 11, 2024

BrynCooke commented Dec 13, 2024

BrynCooke commented Dec 20, 2024

davidegreenwald commented Dec 20, 2024

Datadog spans explosion after v1.53.x #6355

Datadog spans explosion after v1.53.x #6355

Comments

davidegreenwald commented Nov 28, 2024

Describe the bug

Expected behavior

Additional context

str1aN commented Dec 9, 2024

BrynCooke commented Dec 11, 2024 • edited Loading

davidegreenwald commented Dec 11, 2024

BrynCooke commented Dec 13, 2024

BrynCooke commented Dec 20, 2024

davidegreenwald commented Dec 20, 2024

BrynCooke commented Dec 11, 2024 •

edited

Loading