Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(openai): add token usage stream options to request #11606

Merged
merged 16 commits into from
Jan 9, 2025

Conversation

Yun-Kim
Copy link
Contributor

@Yun-Kim Yun-Kim commented Dec 2, 2024

This PR adds special casing such that any user's openai streamed chat/completion requests, unless explicitly specified otherwise, will by default include the token usage as part of the streamed response.

Motivation

OpenAI streamed responses have historically not provided token usage details as part of the streamed response. However OpenAI earlier this year added a stream_options: {"include_usage": True} kwarg option to the chat/completions API to provide token usage details as part of an additional stream chunk at the end of the streamed response.

If this kwarg option was not specified by the user, then token usage is not provided by OpenAI and our current behavior is to give our best effort to 1) use the tiktoken library to calculate token counts, or 2) use a very crude heuristic to estimate token counts. Both are not ideal as neither alternative takes into account function/tool calling. It is simpler and more accurate to just request the token counts from OpenAI directly.

Proposed design

There are 2 major components for this feature:

  1. If a user does not specify stream_options: {"include_usage": True} as a kwarg on the chat/completions call, we need to manually insert that as part of the kwargs before the request is made.
  2. If a user does not specify stream_options: {"include_usage": True} as a kwarg on the chat/completions call but we add that option on the integration-side, the returned streamed response will include an additional chunk (with empty content) at the end containing token usage information. To avoid disrupting user applications with one more chunk (with different content/fields) than expected, the integration should automatically extract the last chunk under the hood.

Note: if a user does explicitly specify stream_options: {"include_usage": False}, then we must respect their intent and avoid adding token usage into the kwargs. We'll add in our release note that we cannot guarantee 100% accurate token counts in this case.`

Streamed reading logic change

Additionally, we make a change to __iter__/__aiter__ methods of our traced streamed responses. Previously we returned the traced streamed response (and relied on the underlying __next__/__anext__ methods), but to ensure spans will be finished even if the streamed response is not fully consumed, we change the __iter__/__aiter__ methods to implement the stream consumption using a try/catch/finally.

Note: this only applies to

  1. When users use __iter__/__aiter__(), since directly calling __next__()/__anext__() individually will not let us know when the overall response is fully consumed.
  2. When users use __aiter__() and break early, they are still responsible for calling resp.close(), since asynchronous generators do not automatically close when the context manager is exited (this is held until close() is called either manually or by the garbage collector).

Testing

This PR modifies the existing OpenAI streamed completion/chat completion tests to be simplified (use snapshots when possible instead of making large numbers of tedious assertions) and to add coverage for the token extraction behavior (existing tests remove include_usage: True options to assert that the automatic extraction works, and we add a couple tests asserting our original behavior if include_usage: False is explicitly set).

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

Copy link
Contributor

github-actions bot commented Dec 2, 2024

CODEOWNERS have been resolved as:

releasenotes/notes/feat-openai-streamed-chunk-auto-extract-4cbaea8870b1df13.yaml  @DataDog/apm-python
tests/snapshots/tests.contrib.openai.test_openai.test_chat_completion_stream.json  @DataDog/apm-python
tests/snapshots/tests.contrib.openai.test_openai.test_completion_stream.json  @DataDog/apm-python
tests/snapshots/tests.contrib.openai.test_openai_v1.test_completion_stream_est_tokens.json  @DataDog/apm-python
ddtrace/contrib/internal/openai/_endpoint_hooks.py                      @DataDog/ml-observability
ddtrace/contrib/internal/openai/utils.py                                @DataDog/ml-observability
tests/contrib/openai/test_openai_llmobs.py                              @DataDog/ml-observability
tests/contrib/openai/test_openai_v1.py                                  @DataDog/ml-observability

@pr-commenter
Copy link

pr-commenter bot commented Dec 2, 2024

Benchmarks

Benchmark execution time: 2025-01-08 22:33:48

Comparing candidate commit 2da832f in PR branch yunkim/openai-streamed-token-magic with baseline commit 75bed24 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 394 metrics, 2 unstable metrics.

@Yun-Kim Yun-Kim force-pushed the yunkim/openai-streamed-token-magic branch from b2fff76 to 480f3d0 Compare December 31, 2024 02:59
@Yun-Kim Yun-Kim force-pushed the yunkim/openai-streamed-token-magic branch from 480f3d0 to 92f0707 Compare December 31, 2024 03:03
@datadog-dd-trace-py-rkomorn
Copy link

datadog-dd-trace-py-rkomorn bot commented Dec 31, 2024

Datadog Report

Branch report: yunkim/openai-streamed-token-magic
Commit report: 2da832f
Test service: dd-trace-py

✅ 0 Failed, 87 Passed, 1468 Skipped, 4m 10.11s Total duration (35m 16.92s time saved)

@Yun-Kim Yun-Kim marked this pull request as ready for review December 31, 2024 17:06
@Yun-Kim Yun-Kim requested review from a team as code owners December 31, 2024 17:06
@Yun-Kim Yun-Kim force-pushed the yunkim/openai-streamed-token-magic branch from f8c5e1e to 5531b92 Compare January 6, 2025 21:26
Copy link
Contributor

@ncybul ncybul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. I tested it out with a simple dummy script. It looks like everything is working as expected (e.g. we return the usage chunk when user requests it but extract it out under the hood when they leave out that option).

Just curious, why do we want to avoid requesting open ai for the usage information if the user explicitly sets include_usage to False? If we do the extraction under the hood, isn't this a benign change on our end?

@Yun-Kim
Copy link
Contributor Author

Yun-Kim commented Jan 7, 2025

Just curious, why do we want to avoid requesting open ai for the usage information if the user explicitly sets include_usage to False? If we do the extraction under the hood, isn't this a benign change on our end?

You're not wrong that this is strictly a benign change on our part, but the overriding principle here is to minimize disrupting user applications where possible. If a user decides to explicitly reject token usage from OpenAI, then regardless of how strange that use case is (also likely a very rare use case IMO), we should respect that - it's one thing for us to infer/add kwargs, but it's another for us to set a user-provided kwarg to the opposite value specified by the user.

This way, we'll be able to keep most people happy, just with the explicit caveat that users in this specific case shouldn't expect 100% accurate token counts. Does that make sense @ncybul?

Copy link
Member

@Kyle-Verhoog Kyle-Verhoog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one question, otherwise good to me!

ddtrace/contrib/internal/openai/utils.py Show resolved Hide resolved
Copy link
Member

@Kyle-Verhoog Kyle-Verhoog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well done 👏

@Yun-Kim Yun-Kim enabled auto-merge (squash) January 8, 2025 21:50
@Yun-Kim Yun-Kim disabled auto-merge January 8, 2025 22:01
@Yun-Kim Yun-Kim merged commit 35fe7b5 into main Jan 9, 2025
579 checks passed
@Yun-Kim Yun-Kim deleted the yunkim/openai-streamed-token-magic branch January 9, 2025 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants