Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datadog spans explosion after v1.53.x #6355

Open
davidegreenwald opened this issue Nov 28, 2024 · 6 comments · Fixed by #6112
Open

Datadog spans explosion after v1.53.x #6355

davidegreenwald opened this issue Nov 28, 2024 · 6 comments · Fixed by #6112

Comments

@davidegreenwald
Copy link

Describe the bug

We use the Inigo build of the router and recently upgraded from v0.30.11 to 0.30.15 (bringing us from router 1.53.x to 1.57.1).

The Datadog changes across these versions appear to have exploded our span count in Datadog. The router appears to be ignoring the sampling rates on its upstream, parent services which it was previously respecting and is now ingesting all spans. This has lifted our ingest by 100x and will have a cost impact on our Datadog contract.

We're using the Datadog trace exporter: https://www.apollographql.com/docs/graphos/reference/router/telemetry/trace-exporters/datadog

Expected behavior

There should be no change in trace levels from these upgrades.

Additional context

Happy to give you any further information we can here.

@str1aN
Copy link

str1aN commented Dec 9, 2024

We are experiencing this issue too. We have had to disable Datadog trace export entirely.

@BrynCooke
Copy link
Contributor

BrynCooke commented Dec 11, 2024

Hi,
Would it be possible to try: #6112 to see if it fixes things for you both?
We're trying to get a strong signal on this before merging as this has been very tricky to fix.

The background to this is that Datadog has an unusual way of dealing with traces requiring them to be sent to the agent even if they are not sampled.

The PR should allow users to have this behaviour while also passing the sampling priority downstream correctly.

@davidegreenwald
Copy link
Author

@BrynCooke Thank you, that's great to know. We won't be able to test this until January due to holidays and PTO but can take a look then and report back.

Would it be helpful to try to put you in contact with our Datadog rep?

@BrynCooke
Copy link
Contributor

@davidegreenwald Thank you for the offer but I'm not sure it would help.
The issue is that there is a mismatch between the way that Otel works and the way that Datadog works. Otel doesn't support various parts of the Datadog behaviour, and even if it does it's not well tested.

Longer term it would be great if Datadog promoted the Otel standard and made this their preferred ingestion mechanism, but I'm not sure it would be in their commercial interests. The alternative has been that we reverse engineer the Datadog protocols and behaviour to try and make everything play nicely.

We're going to try and upstream some of the work that was done in this PR, in particular the way that sampling is handled and PSR is propagated.

@BrynCooke
Copy link
Contributor

I just wanted to flag here that we are have a PR for a new release with the telemetry fixes in:

#6483

It contains another fix over and above #6112

Internal testing and testing by select users seems to indicates that this issue is fixed.

@davidegreenwald
Copy link
Author

@BrynCooke Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants