Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Metrics #38

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Performance Metrics #38

wants to merge 9 commits into from

Conversation

jeisenman23
Copy link
Collaborator

This PR adds time tracking capability within ELM so that we can measure how long queries to the database take, how long chat completions take, and how long the chat function takes to invoke.

@grantbuster grantbuster self-requested a review November 15, 2024 20:20
token_budget=None,
new_info_threshold=0.7,
convo=False,
timeit=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add this to the parameters (input args) docstrings. Please make sure there are line breaks before Parameters and Returns (looks like those line breaks got removed here, not sure why)

@@ -87,6 +90,9 @@ def engineer_query(self, query, token_budget=None, new_info_threshold=0.7,
references : list
The list of references (strs) used in the engineered prompt is
returned here
vector_query_time : float
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you didnt do this but can you add a docstring for used_index here? not sure why we dont have one.

@@ -184,7 +192,8 @@ def chat(self, query,
valid ref_col.
return_chat_obj : bool
Flag to only return the ChatCompletion from OpenAI API.

timeit : bool
Flag to return the performance metrics on API calls.
Returns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same general comments about docstrings here.

total_chat_time = start_chat_time - end_time
performance = {
"total_chat_time": total_chat_time,
"chat_completion_time": chat_completion_time,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am i understanding correctly that the chat completion time is time to get first response from the LLM? That's great and important but if the LLM is in "stream" mode, does this actually work or does the API return something immediately but it only starts responding in the for chunk in response generator loop? If this is the case, maybe we should have the end time be calculated at the first chunk generation in the response object.

@@ -152,10 +160,10 @@ def chat(self, query,
token_budget=None,
new_info_threshold=0.7,
print_references=False,
return_chat_obj=False):
return_chat_obj=False,
timeit=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we just always have timeit=True? I think we should just remove this as a kwarg and always return performance metrics. This will break anything that expects a certain number of outputs but the compute is basically free and it's useful information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants