-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(openai): add token usage stream options to request #11606
Conversation
|
BenchmarksBenchmark execution time: 2025-01-08 22:33:48 Comparing candidate commit 2da832f in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 394 metrics, 2 unstable metrics. |
b2fff76
to
480f3d0
Compare
480f3d0
to
92f0707
Compare
Datadog ReportBranch report: ✅ 0 Failed, 87 Passed, 1468 Skipped, 4m 10.11s Total duration (35m 16.92s time saved) |
f8c5e1e
to
5531b92
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. I tested it out with a simple dummy script. It looks like everything is working as expected (e.g. we return the usage chunk when user requests it but extract it out under the hood when they leave out that option).
Just curious, why do we want to avoid requesting open ai for the usage information if the user explicitly sets include_usage
to False? If we do the extraction under the hood, isn't this a benign change on our end?
You're not wrong that this is strictly a benign change on our part, but the overriding principle here is to minimize disrupting user applications where possible. If a user decides to explicitly reject token usage from OpenAI, then regardless of how strange that use case is (also likely a very rare use case IMO), we should respect that - it's one thing for us to infer/add kwargs, but it's another for us to set a user-provided kwarg to the opposite value specified by the user. This way, we'll be able to keep most people happy, just with the explicit caveat that users in this specific case shouldn't expect 100% accurate token counts. Does that make sense @ncybul? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just one question, otherwise good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well done 👏
This PR adds special casing such that any user's openai streamed chat/completion requests, unless explicitly specified otherwise, will by default include the token usage as part of the streamed response.
Motivation
OpenAI streamed responses have historically not provided token usage details as part of the streamed response. However OpenAI earlier this year added a
stream_options: {"include_usage": True}
kwarg option to the chat/completions API to provide token usage details as part of an additional stream chunk at the end of the streamed response.If this kwarg option was not specified by the user, then token usage is not provided by OpenAI and our current behavior is to give our best effort to 1) use the
tiktoken
library to calculate token counts, or 2) use a very crude heuristic to estimate token counts. Both are not ideal as neither alternative takes into account function/tool calling. It is simpler and more accurate to just request the token counts from OpenAI directly.Proposed design
There are 2 major components for this feature:
stream_options: {"include_usage": True}
as a kwarg on the chat/completions call, we need to manually insert that as part of the kwargs before the request is made.stream_options: {"include_usage": True}
as a kwarg on the chat/completions call but we add that option on the integration-side, the returned streamed response will include an additional chunk (with empty content) at the end containing token usage information. To avoid disrupting user applications with one more chunk (with different content/fields) than expected, the integration should automatically extract the last chunk under the hood.Note: if a user does explicitly specify
stream_options: {"include_usage": False}
, then we must respect their intent and avoid adding token usage into the kwargs. We'll add in our release note that we cannot guarantee 100% accurate token counts in this case.`Streamed reading logic change
Additionally, we make a change to
__iter__/__aiter__
methods of our traced streamed responses. Previously we returned the traced streamed response (and relied on the underlying__next__/__anext__
methods), but to ensure spans will be finished even if the streamed response is not fully consumed, we change the__iter__/__aiter__
methods to implement the stream consumption using a try/catch/finally.Note: this only applies to
__iter__/__aiter__()
, since directly calling__next__()/__anext__()
individually will not let us know when the overall response is fully consumed.__aiter__()
and break early, they are still responsible for callingresp.close()
, since asynchronous generators do not automatically close when the context manager is exited (this is held until close() is called either manually or by the garbage collector).Testing
This PR modifies the existing OpenAI streamed completion/chat completion tests to be simplified (use snapshots when possible instead of making large numbers of tedious assertions) and to add coverage for the token extraction behavior (existing tests remove
include_usage: True
options to assert that the automatic extraction works, and we add a couple tests asserting our original behavior ifinclude_usage: False
is explicitly set).Checklist
Reviewer Checklist