Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for injection of otel spans into tasks for enhancement #31

Open
cchanley2003 opened this issue Sep 25, 2023 · 15 comments · May be fixed by #55
Open

Support for injection of otel spans into tasks for enhancement #31

cchanley2003 opened this issue Sep 25, 2023 · 15 comments · May be fixed by #55

Comments

@cchanley2003
Copy link

Dapr workflows looks like it uses durable tasks. In order to support this request: dapr/dapr#6906 it looks like we would need to support the ability to inject spans into tasks to allow client to add onto spans and add attributes to existing spans.

First wanted to confirm that what I being requested isn't currently supported and also confirm that you would accept a PR for this enhancement?

Initial though was to put span into: https://github.com/microsoft/durabletask-python/blob/4046191d28a6f0a7779d829e5aec88a06493f3d8/durabletask/task.py#L339 to allow clients to enhance spans...

Maybe this is not the right repo to start this process. If so can somebody point me in the right direction?

@cgillum
Copy link
Member

cgillum commented Oct 10, 2023

Hi @cchanley2003. While I think this request may require work in other repos as well (i.e., the SDK repos), this is a good repo for getting things going, especially since your request relates to Dapr.

I responded just now to dapr/dapr#6906. Please take a look at that when you get a chance.

I'm interpreting your request to be surfacing OTel span information into the activity task so that you're able to correlate your own custom logs or attributes with the end-to-end distributed trace. Currently, each SDK should have access to the parent trace context, but I don't think we're currently making that context available to the user code. If I understand the ask correctly, I think this is definitely something we can do.

@cchanley2003
Copy link
Author

Yes. In an ideal world we surface the span at the WorkflowActivityContext in dapr, which implies adding it to the ActivityContext in durable tasks.

So if I have a activity function definition:

def model_properties(ctx: WorkflowActivityContext, input):

From the ctx I could access the otel span. It looks like some work has been started in this direction with the proto buf definitions to add trace info, so the work seems like it would just be getting that added to the streams... but that was my initial glance.

@cchanley2003
Copy link
Author

So I have been looking more into this. One thing I have noticed is that the go version points to a distributed tracing branch of the proto-bufs file. The python sdk points to the main branch. It looks like those version are incompatible. For instance:

message ExecutionStartedEvent {
...
 TraceContext parentTraceContext = 7;
 ..

in the branch.
vs

google.protobuf.StringValue correlationData = 7;

in the main which is what the python sdk points at. Maybe this is a better discussion for the python sdk project? I want to leverage the trace context as a jumping of point.

Also is this the best place to have this discussion? Let me know.

@cgillum
Copy link
Member

cgillum commented Oct 31, 2023

Oh, interesting. I'll need to take a closer look at this to figure out what happened. I believe that the top version is the "correct" one, but I'm not sure how for things have gotten out of sync.

@cgillum
Copy link
Member

cgillum commented Nov 1, 2023

@cchanley2003 thanks for pointing out this misalignment with the protobufs. I went ahead and corrected this as part of #43. Unfortunately, it's unlikely that the Dapr runtime will be updated with this new build until 1.13 (unless somehow we managed to bundle it in as part of a hotfix).

@cchanley2003
Copy link
Author

cchanley2003 commented Nov 9, 2023

Thanks for updating that. Looks like things are in sync now. Without a reference implementation I am a little confused about the intent of the parentTrace in the ExecutionStartedEvent. Can you comment on the design intent there? Here was my initial thought of an implementation but I want to be sure I am swimming upstream. I am going to focus on the Activity use case. I imagine something similar for orchestration.

  1. Backend provides a parentTraceContext as part of ActivityRequest
  2. client sdk executor receives ActivityRequest and starts a new span off of parentTrace. It injects span into ActivityContext
  3. Client code uses span to do what it will (sub spans, attributes)
  4. client sdk activity executor closes span

This implies that the client needs to have the ability to wire in a backend for otel, which is doesn't exist atm.

As I mentioned before in Discord I am willing to submit PRs to make this happen. I am also willing to do this across all the sdks, but I want to be sure we are in accord on design. None of what I have mentioned uses the ExecutionStarted events so I feel I might be off base on approach.

Thanks for your time and appreciate any feedback.

@cgillum
Copy link
Member

cgillum commented Nov 29, 2023

  1. Backend provides a parentTraceContext as part of ActivityRequest

I haven't yet done the investigation to know if this step is required, but it might be. Alternatively, it's possible that the parent trace context is already flowing as part of the gRPC request since (I think) the context we use to send the gRPC request already has some trace context information associated with it. This part needs testing/experimentation.

The rest of the steps look correct to me.

Adding @kaibocai who has also expressed interest in working on this problem but let us know @cchanley2003 if you'd prefer to contribute it. :)

@kaibocai
Copy link
Member

kaibocai commented Dec 2, 2023

Hi @cchanley2003 ,

I think you are right on ActivityRequest it needs to have parentTraceContext in order to provide the ability for activities to add their spans. Currently, the code has parentTraceContext ExecutionStartedEvent, I think that's because we want to provide the same ability for workflows. But definitely, parentTraceContext needs to be added to ActivityRequest as well.

@kaibocai
Copy link
Member

kaibocai commented Dec 6, 2023

Regards to context propagation through Grpc, the recommend way from OTEL is to have trace context encoded into the metadata of Grpc message instead of the message itself.

OTEL already has SDK that helps to simplify context propagation through Grpc, still researching in this direction.
Java instrumentation library: https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/grpc-1.6/library#library-instrumentation-for-grpc-160
Go should have similar support, will continue researching.

@cgillum

@cchanley2003
Copy link
Author

Thanks for the replies @kaibocai. So yeah I saw the ExecutionStartedEvent. My focus is on the the python SDK (though I am willing to do fixes in other sdks) and it didn't look like that Event was used at all so I was uncertain on where it fit in with propagating the traces (or even the intent there). So it would seem that the first step would be the add the traceContext to the ActivityRequest in the .protos? I can (or somebody) start with a PR for that?

@kaibocai
Copy link
Member

kaibocai commented Dec 6, 2023

Hi @cchanley2003 , I am new to OTEL, so maybe I am wrong about the following statement. Based on my research, I found the suggested way to propagate context through grpc is not to encode the trace context through the Grpc message, instead it's recommended to propagate through the metadata of grpc message, and it seems OTEL already has libraries helping to do this work.

For instance, the Python library can be found here https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation/opentelemetry-instrumentation-grpc, have you ever considered this direction, so that no updates on the protobuf is needed. I believe it will simplify the implementation.

Also, it would be great if you could help with Python SDK for this support I can help as well! Thank you!

@cchanley2003
Copy link
Author

Yes, that makes sense. Just saw your earlier post. So no need to update the proto.

@kaibocai
Copy link
Member

kaibocai commented Jan 4, 2024

I did more research and tested a bit on my local, it seems the instrumentation libraries cannot meet our requirements here. The instrumentation libraries will simplify tracing the Grpc call, but it doesn't provide an easy way to pass through trace context (no easy access to the trace context inert to the grpc call metadata). So I think the original direction (encode trace context into Grpc message) is more suitable here.

@shivamkm07
Copy link
Contributor

I did more research and tested a bit on my local, it seems the instrumentation libraries cannot meet our requirements here. The instrumentation libraries will simplify tracing the Grpc call, but it doesn't provide an easy way to pass through trace context (no easy access to the trace context inert to the grpc call metadata). So I think the original direction (encode trace context into Grpc message) is more suitable here.

Agree @kaibocai. I looked into it as well. We can easily propagate context for a normal grpc request(non-streaming) using otel interceptors(e.g. here) but for streaming requests, context is propagated at the start only i.e. when the stream is created. Once the stream creation is done, then since no context is passed during stream.send() method, and because we can't update stream context, the newer context(updated after stream creation) is never passed to client/server.

Here in DTF-Go as well at the time of stream creation, there is nothing in context(since no orchestration/acitivity are created yet). We have the updated context(with all span info) when activity/orchestration workitem needs to be sent to the app but this context is not passed while sending the workItem. So best way to do it is to encode trace context in the message itself.

@pralonga
Copy link

Any update on the issue and the related PRs ? Having the parent span in the client activities would be nice.

I am using dapr-workflow with the dotnet SDK, but most of the work done in the PRs should resolve the issue there too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants