Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle BatchWriteSpans retries when a retryable error occurs #523

Open
aabmass opened this issue Feb 6, 2023 · 8 comments
Open

Handle BatchWriteSpans retries when a retryable error occurs #523

aabmass opened this issue Feb 6, 2023 · 8 comments
Labels
enhancement accepted An actionable enhancement for which PRs will be accepted enhancement New feature or request priority: p2 trace

Comments

@aabmass
Copy link
Contributor

aabmass commented Feb 6, 2023

Follow up to #181

Trace

Since we are using gRPC, I believe we will get automatic retries for well known retry-able statuses. However, the trace exporter still needs to handle retries for idempotent BatchWriteSpans() calls. Unfortunately, we are not generating a client library into the googleapis/google-cloud-node repo that we can use instead of raw gRPC. I think this is because we have a "handwritten" client lib which is actually the Cloud Trace agent (https://github.com/googleapis/cloud-trace-nodejs). If we had a client library, we would get automatic retries pulled from the service config.

Possible fixes:

  1. Add one-off code to do retries at the call site
  2. Generate a client library and migrate to use it
  3. Looks like @grpc/grpc-js actually supports retry config encoded in the grpc.service_config channel options. I have not tried this, some details here.
@aabmass aabmass added enhancement New feature or request trace enhancement accepted An actionable enhancement for which PRs will be accepted labels Feb 6, 2023
@aabmass
Copy link
Contributor Author

aabmass commented Feb 6, 2023

Triaged

@weyert
Copy link

weyert commented Apr 6, 2023

@aabmass @dashpole Would this be related to errors I am seeing in Error Reporter like:

batchWriteSpans error: 14 UNAVAILABLE: The service is currently unavailable.
batchWriteSpans error: 14 UNAVAILABLE: 502:Bad Gateway

Happened on 5 April 2023 but nothing Google Cloud status page for Google Trace. Timing around 21:07, 21:38, and 16:08-16:14

@aabmass
Copy link
Contributor Author

aabmass commented Apr 6, 2023

Retries could have possibly helped, but its hard to say. We would eventually drop data even with retries if the backend wasn't responding for long enough.

@ryoya-i-1215
Copy link

ryoya-i-1215 commented Jul 3, 2023

@aabmass @dashpole

We would eventually drop data even with retries if the backend wasn't responding for long enough.

Is this related to this issue:
open-telemetry/opentelemetry-js#3740 (comment)

Please answer the following three questions:

  1. When will the retry process be added?
  2. If the timeout error is related to this problem, how can it be resolved?
  3. If I use cloud run for the backend, would extending the task timeout solve the problem?

@aabmass
Copy link
Contributor Author

aabmass commented Jul 5, 2023

  1. If the timeout error is related to this problem, how can it be resolved?

It looks like the comment in that issue is unrelated to that issue.

  1. If I use cloud run for the backend, would extending the task timeout solve the problem?

I don't think that would help. If you're seeing this with Cloud Run, it's probably related to CPU throttling which we have seen a few times. You can try the "CPU is always allocated" option which is described in that page (see discussion on #62).

If you want to leave CPU throttling as is, the retries could help get your data sent, but you're likely to still see some error logs. open-telemetry/opentelemetry-js#3740 (comment) explains how to silence the logs.

@aabmass
Copy link
Contributor Author

aabmass commented Jul 5, 2023

  1. When will the retry process be added?

I think our best bet for implementing retries "generate a client library and migrate to use it" as described in this issue. However, that depends on another Google team first generating the client library and then this exporter pulling it in. I don't have a ton of time to work on this unfortunately.

@aabmass aabmass mentioned this issue Jul 5, 2023
8 tasks
@gblock0
Copy link

gblock0 commented Jul 6, 2023

Hey @aabmass, I'm the commenter from the OT issue linked above. Like I mention in my comment, we are seeing a bunch of these batchWriteSpans exceptions and timeouts throughout the day like @weyert mentions above.

We aren't using Cloud Functions, but we are using Kubernetes Engine to run our deployments. I don't mind dropping the logs if they are getting retried, but is there a way for me to verify they are getting retried?

@aabmass
Copy link
Contributor Author

aabmass commented Jul 7, 2023

They are not getting retried right now 🙁 This issue is for implementing retries

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement accepted An actionable enhancement for which PRs will be accepted enhancement New feature or request priority: p2 trace
Projects
None yet
Development

No branches or pull requests

5 participants