Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TTFT (time to first token) to Langfuse traces #8385

Open
LastRemote opened this issue Sep 19, 2024 · 9 comments
Open

Add TTFT (time to first token) to Langfuse traces #8385

LastRemote opened this issue Sep 19, 2024 · 9 comments
Assignees
Labels
P2 Medium priority, add to the next sprint if no P1 available

Comments

@LastRemote
Copy link
Contributor

LastRemote commented Sep 19, 2024

Is your feature request related to a problem? Please describe.
We are developing several chatbot-like applications that require streaming the response from LLM. There are a couple of metrics to look at, and one of which is the TTFT (time to first token) indicating how long the user needs to wait before seeing something in the output dialog box. However, due to the way that the tracing spans are handled in the pipeline, the run invocation inside the component does not have direct access to the span, so we are not able to log this information to the tracer.

Describe the solution you'd like
The simplest solution would be adding visibility to tracing span from component run() method. This could be a context variable that the methods inside of the component have access to, but I am not very confident about the exact approach here.

Describe alternatives you've considered
The only temporary solution right now is to directly manipulate low-level tracing sdks inside the streaming callback function, and make a special callback function such that it uploads the timestamp upon receiving the first SSE.

Additional context
Add any other context or screenshots about the feature request here.

@julian-risch julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Sep 19, 2024
@vblagoje
Copy link
Member

vblagoje commented Sep 23, 2024

Note to self:

TTFT in Langfuse is automatically calculated when completion_start_time (the timestamp of the first token) is provided to generation span i.e. just call update on the generation span with this key/value i.e completion_start_time=datetime.now()

This could be done by attaching custom stream callback on chat generator (most likely from our LangfuseTracer), consuming first token in the callback and calling generation span update - perhaps directly in that callback.

To be investigated how we can eventually do this in async calls as well.

@vblagoje
Copy link
Member

vblagoje commented Sep 26, 2024

@LastRemote and @julian-risch

This was, in fact, not so hard to do. Forget the recommendation above. We need to simply timestamp the first chunk received from the LLM. And we should do this across all LLM chat generators. When streaming we don't have data for prompt and completion tokens available and what's really interesting is that even if we set "prompt_tokens" and "completion_tokens" to 0 (see branch above) Langfuse somehow correctly counts them. I'll check with them how this is actually done. Here is the trace depicted below.

Image

@LastRemote
Copy link
Contributor Author

@LastRemote and @julian-risch

This was, in fact, not so hard to do. Forget the recommendation above. We need to simply timestamp the first chunk received from the LLM. And we should do this across all LLM chat generators. When streaming we don't have data for prompt and completion tokens available and what's really interesting is that even if we set "prompt_tokens" and "completion_tokens" to 0 (see branch above) Langfuse somehow correctly counts them. I'll check with them how this is actually done.

Image

Interesting, I have never thought about this approach, but I guess it should work. So basically we store completion_first_chunk as a part of the usage meta so it can be accessed inside of haystack.component.output?

By the way, for OpenAI and the latest versions of Azure OpenAI models, if you set stream_options accordingly, the last streaming chunk will include the actual usage data. Additionally Langfuse will automatically count the usage tokens if the model is from OpenAI or Claude (see screenshot below).

Screenshot 2024-09-26 at 17 14 49

@vblagoje
Copy link
Member

Aha nice @LastRemote that's why I collected those meta chunks hoping that one day this will work. The change in langfuse tracer was minimal as well. One LOC change in langfuse/tracer.py at ~151, we need to:

span._span.update(usage=meta.get("usage") or None,
                                  model=meta.get("model"),
                                  completion_start_time=meta.get("usage", {}).get("completion_start_time"))

I'll speak to @julian-risch about scheduling this change in the near future but if you wish - feel free to create a PR that updates our chat generators with this change and we'll take it from there - I can review the PR and we can try out various chat generators together.

@LastRemote
Copy link
Contributor Author

@vblagoje Okay sure, I will make a PR.

@vblagoje vblagoje changed the title Access to tracing span during component run invocation Add TTFT (time to first token) to Langfuse traces Sep 30, 2024
@vblagoje
Copy link
Member

@LastRemote before we open a bunch of PRs or group together all the changes into one PR for all chat generators - let's do one trial PR and set the standard for other chat generators.

@LastRemote
Copy link
Contributor Author

@vblagoje Sorry for the delayed response. I took some days off last week. Here we go: #8444

Note that I only made minimal changes since I am not exactly sure how to enable include_usage in OpenAI SDK. I have a customized OpenAI implementation based on httpx that works, but that would require a complete refactor.

@LastRemote
Copy link
Contributor Author

LastRemote commented Oct 8, 2024

By the way, I also attempted to support Anthropic (including Bedrock Anthropic) models. It seems there's a mismatch in the usage data format between the Langfuse API and Anthropic when updating the Langfuse span, which causes the operation to fail. I believe it would be better for Langfuse to provide direct support for the raw Anthropic format.

@vblagoje
Copy link
Member

vblagoje commented Oct 8, 2024

Hey @LastRemote I'll check it out and get back to you. The most recent Anthropic haystack release should work our of the box. I don't think Bedrock Anthropic works with Langfuse atm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Medium priority, add to the next sprint if no P1 available
Projects
None yet
Development

No branches or pull requests

4 participants
@vblagoje @julian-risch @LastRemote and others