-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Metrics #38
base: main
Are you sure you want to change the base?
Conversation
token_budget=None, | ||
new_info_threshold=0.7, | ||
convo=False, | ||
timeit=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add this to the parameters (input args) docstrings. Please make sure there are line breaks before Parameters
and Returns
(looks like those line breaks got removed here, not sure why)
@@ -87,6 +90,9 @@ def engineer_query(self, query, token_budget=None, new_info_threshold=0.7, | |||
references : list | |||
The list of references (strs) used in the engineered prompt is | |||
returned here | |||
vector_query_time : float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you didnt do this but can you add a docstring for used_index
here? not sure why we dont have one.
@@ -184,7 +192,8 @@ def chat(self, query, | |||
valid ref_col. | |||
return_chat_obj : bool | |||
Flag to only return the ChatCompletion from OpenAI API. | |||
|
|||
timeit : bool | |||
Flag to return the performance metrics on API calls. | |||
Returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same general comments about docstrings here.
total_chat_time = start_chat_time - end_time | ||
performance = { | ||
"total_chat_time": total_chat_time, | ||
"chat_completion_time": chat_completion_time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am i understanding correctly that the chat completion time is time to get first response from the LLM? That's great and important but if the LLM is in "stream" mode, does this actually work or does the API return something immediately but it only starts responding in the for chunk in response
generator loop? If this is the case, maybe we should have the end time be calculated at the first chunk generation in the response object.
@@ -152,10 +160,10 @@ def chat(self, query, | |||
token_budget=None, | |||
new_info_threshold=0.7, | |||
print_references=False, | |||
return_chat_obj=False): | |||
return_chat_obj=False, | |||
timeit=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we just always have timeit=True? I think we should just remove this as a kwarg and always return performance metrics. This will break anything that expects a certain number of outputs but the compute is basically free and it's useful information.
This PR adds time tracking capability within ELM so that we can measure how long queries to the database take, how long chat completions take, and how long the chat function takes to invoke.