Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge from main. #27

Merged
merged 62 commits into from
Jan 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
f7860ff
Wrap up explanable metrics
penguine-ip Dec 25, 2023
860564d
fix tests
penguine-ip Dec 26, 2023
4ea05b1
Optimized faithfulness reasoning
penguine-ip Dec 26, 2023
e79a90f
.
penguine-ip Dec 26, 2023
db5879f
Merge pull request #387 from confident-ai/hotfix/explanablemetrics
penguine-ip Dec 26, 2023
9835fbe
Delay transformers import
penguine-ip Dec 26, 2023
01c3b59
Merge pull request #388 from confident-ai/hotfix/seperatetransoformer…
penguine-ip Dec 26, 2023
448431f
Update README.md
penguine-ip Dec 26, 2023
0c75412
Fix azure commands
penguine-ip Dec 26, 2023
36bd7a9
Updated azure docs
penguine-ip Dec 26, 2023
668c8d0
Merge pull request #389 from confident-ai/hotfix/azurecommands
penguine-ip Dec 26, 2023
ab16dc3
new release
penguine-ip Dec 26, 2023
bb7e7aa
Merge pull request #390 from confident-ai/release-v0.20.43
penguine-ip Dec 26, 2023
86853ff
Updated docs
penguine-ip Dec 26, 2023
8429e27
Updated docs
penguine-ip Dec 26, 2023
9c6040e
Updated docs
penguine-ip Dec 27, 2023
a37b541
.
penguine-ip Dec 27, 2023
87aa422
Updated docs
penguine-ip Dec 27, 2023
97076e1
Fix docs
penguine-ip Dec 27, 2023
a883f02
Added threading to track
penguine-ip Dec 28, 2023
1cf39f4
rename
penguine-ip Dec 28, 2023
5e3f12a
Merge pull request #391 from confident-ai/feature/asynctracking
penguine-ip Dec 28, 2023
2a05146
LLamaindex tracing
penguine-ip Dec 28, 2023
25bd4c9
Added llama
penguine-ip Dec 28, 2023
8a5e382
llamaindex tracing
penguine-ip Jan 1, 2024
903352a
added dependency
penguine-ip Jan 1, 2024
3396a03
Merge pull request #392 from confident-ai/feature/llamaintegration
penguine-ip Jan 2, 2024
03d767f
updated docs
penguine-ip Jan 2, 2024
4069536
updated docs
penguine-ip Jan 3, 2024
6eb34e2
Sentry counter
penguine-ip Jan 3, 2024
5c7a40a
Make threshold dynamic
penguine-ip Jan 3, 2024
a93ce75
Merge pull request #394 from confident-ai/feature/sentry-counter
penguine-ip Jan 3, 2024
2613938
Merge pull request #395 from confident-ai/hotfix/hardcoded-threshold
penguine-ip Jan 3, 2024
8689b7b
Update README.md
penguine-ip Jan 3, 2024
7ec7044
new release
penguine-ip Jan 3, 2024
5497a18
added progress loading
penguine-ip Jan 3, 2024
7ea687e
Remove import
penguine-ip Jan 3, 2024
972cd1d
Merge pull request #396 from confident-ai/hotfix/progressloading
penguine-ip Jan 3, 2024
b916de1
Merge pull request #398 from confident-ai/release-v0.20.44
penguine-ip Jan 3, 2024
f476ea6
updated docs
penguine-ip Jan 3, 2024
6fad5cb
Add maximum score base metric
penguine-ip Jan 10, 2024
f0d6d3e
reformat
penguine-ip Jan 10, 2024
28d1072
Fix langchain azure
penguine-ip Jan 10, 2024
8d8c291
Fix docs
penguine-ip Jan 10, 2024
af29c18
Migrate minimum score to threshold
penguine-ip Jan 11, 2024
1e5fbab
Fix langchain chat models
penguine-ip Jan 11, 2024
316ad41
Merge pull request #400 from confident-ai/features/latency-and-cost
penguine-ip Jan 11, 2024
be2c541
Merge pull request #401 from confident-ai/hotfix/langchain-azure
penguine-ip Jan 11, 2024
43675ae
Added threshold
penguine-ip Jan 12, 2024
3cbcb0c
Merge pull request #402 from confident-ai/feature/threshold-confident
penguine-ip Jan 12, 2024
a4d8023
new release
penguine-ip Jan 12, 2024
6aebc24
Merge pull request #403 from confident-ai/release-v0.20.45
penguine-ip Jan 12, 2024
c6a7a55
Updated docs
penguine-ip Jan 12, 2024
02559e2
Fix display logic
penguine-ip Jan 12, 2024
877842f
.
penguine-ip Jan 12, 2024
b1115a1
.
penguine-ip Jan 12, 2024
f81f82b
Merge pull request #404 from confident-ai/hotfix/threshold-logic
penguine-ip Jan 12, 2024
dcdc07e
.
penguine-ip Jan 12, 2024
6e63b98
Updated docs
penguine-ip Jan 12, 2024
7b8e1b4
new release
penguine-ip Jan 12, 2024
ed715bd
Merge pull request #405 from confident-ai/release-v0.20.46
penguine-ip Jan 12, 2024
6b4e6e9
Update README.md
penguine-ip Jan 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified .DS_Store
Binary file not shown.
46 changes: 37 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,24 +18,28 @@
</a>
</p>

**DeepEval** is a simple-to-use, open-source evaluation framework for LLM applications. It is similar to Pytest but specialized for unit testing LLM applications. DeepEval evaluates performance based on metrics such as hallucination, answer relevancy, RAGAS, etc., using LLMs and various other NLP models **locally on your machine**.
**DeepEval** is a simple-to-use, open-source LLM evaluation framework for LLM applications. It is similar to Pytest but specialized for unit testing LLM applications. DeepEval evaluates performance based on metrics such as hallucination, answer relevancy, RAGAS, etc., using LLMs and various other NLP models that runs **locally on your machine**.

Whether your application is implemented via RAG or fine-tuning, LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal hyperparameters to improve your RAG pipeline, prevent prompt drifting, or even transition from OpenAI to hosting your own Llama2 with confidence.

<br />

# Features

- Large variety of ready-to-use evaluation metrics powered by LLMs, statistical methods, or NLP models that runs **locally on your machine**:
- Large variety of ready-to-use LLM evaluation metrics powered by LLMs (all with explanations), statistical methods, or NLP models that runs **locally on your machine**:
- Hallucination
- Summarization
- Answer Relevancy
- Faithfulness
- Contextual Recall
- Contextual Precision
- RAGAS
- G-Eval
- Toxicity
- Bias
- etc.
- Evaluate your entire dataset in bulk in under 20 lines of Python code **in parallel**. Do this via the CLI in a Pytest-like manner, or through our `evaluate()` function.
- Easily create your own custom metrics that are automatically integrated with DeepEval's ecosystem by inheriting DeepEval's base metric class.
- Evaluate your entire dataset in bulk using fewer than 20 lines of Python code **in parallel**.
- [Automatically integrated with Confident AI](https://app.confident-ai.com) for continous evaluation throughout the lifetime of your LLM (app):
- log evaluation results and analyze metrics pass / fails
- compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results
Expand Down Expand Up @@ -90,7 +94,7 @@ def test_case():

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra costs."
hallucination_metric = HallucinationMetric(minimum_score=0.7)
hallucination_metric = HallucinationMetric(threshold=0.7)
test_case = LLMTestCase(input=input, actual_output=actual_output, context=context)
assert_test(test_case, [hallucination_metric])
```
Expand All @@ -104,13 +108,36 @@ deepeval test run test_chatbot.py
**Your test should have passed ✅** Let's breakdown what happened.

- The variable `input` mimics user input, and `actual_output` is a placeholder for your chatbot's intended output based on this query.
- The variable `context` contains the relevant information from your knowledge base, and `HallucinationMetric(minimum_score=0.7)` is an out-of-the-box metric provided by DeepEval. It helps you evaluate the factual accuracy of your chatbot's output based on the provided context.
- The metric score ranges from 0 - 1. The `minimum_score=0.7` threshold ultimately determines whether your test has passed or not.
- The variable `context` contains the relevant information from your knowledge base, and `HallucinationMetric(threshold=0.7)` is an out-of-the-box metric provided by DeepEval. It helps you evaluate the factual accuracy of your chatbot's output based on the provided context.
- The metric score ranges from 0 - 1. The `threshold=0.7` threshold ultimately determines whether your test has passed or not.

[Read our documentation](https://docs.confident-ai.com/docs/getting-started) for more information on how to use additional metrics, create your own custom metrics, and tutorials on how to integrate with other tools like LangChain and LlamaIndex.

<br />

## Evaluating Without Pytest Integration

Alternatively, you can evaluate without Pytest, which is more suited for a notebook environment.

```python
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

input = "What if these shoes don't fit?"
context = ["All customers are eligible for a 30 day full refund at no extra costs."]
# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra costs."

hallucination_metric = HallucinationMetric(threshold=0.7)
test_case = LLMTestCase(
input=input,
actual_output=actual_output,
context=context
)
evaluate([test_case], [hallucination_metric])
```

## Evaluting a Dataset / Test Cases in Bulk

In DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate things in bulk:
Expand All @@ -132,8 +159,8 @@ dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])
dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
hallucination_metric = HallucinationMetric(minimum_score=0.3)
answer_relevancy_metric = AnswerRelevancyMetric(minimum_score=0.5)
hallucination_metric = HallucinationMetric(threshold=0.3)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
assert_test(test_case, [hallucination_metric, answer_relevancy_metric])
```

Expand All @@ -144,7 +171,7 @@ deepeval test run test_<filename>.py -n 4

<br/>

Alternatively, although we recommend using `deepeval test run`, you can evaluate a dataset/test cases without using pytest:
Alternatively, although we recommend using `deepeval test run`, you can evaluate a dataset/test cases without using our Pytest integration:

```python
from deepeval import evaluate
Expand All @@ -164,6 +191,7 @@ We offer a [free web platform](https://app.confident-ai.com) for you to:
3. Compare and pick the optimal hyperparameteres (prompt templates, models, chunk size, etc.).
4. Create, manage, and centralize your evaluation datasets.
5. Track events in production and augment your evaluation dataset for continous evaluation in production.
6. Track events in production and view live evaluation results over time.

Everything on Confident AI, including how to use Confident is available [here](https://docs.confident-ai.com/docs/confident-ai-introduction).

Expand Down
2 changes: 1 addition & 1 deletion deepeval/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
import re

# Optionally add telemtry
from .telemetry import *
from ._version import __version__

from .decorators.hyperparameters import set_hyperparameters
from deepeval.event import track
from deepeval.evaluate import evaluate, run_test, assert_test
from deepeval.test_run import on_test_run_end
from deepeval.telemetry import *

__all__ = [
"set_hyperparameters",
Expand Down
2 changes: 1 addition & 1 deletion deepeval/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__: str = "0.20.42"
__version__: str = "0.20.46"
80 changes: 79 additions & 1 deletion deepeval/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,13 @@
import requests
import warnings
from requests.adapters import HTTPAdapter, Response, Retry
import aiohttp
from aiohttp import ClientSession
from requests.adapters import HTTPAdapter
from enum import Enum

from deepeval.constants import API_KEY_ENV
from deepeval.key_handler import KEY_FILE_HANDLER, KeyValues
from enum import Enum

API_BASE_URL = "https://app.confident-ai.com/api"

Expand Down Expand Up @@ -259,3 +263,77 @@ def quote_string(text: str) -> str:
str: Quoted text in return
"""
return urllib.parse.quote(text, safe="")

async def _api_request_async(
self,
method,
endpoint,
headers=None,
auth=None,
params=None,
body=None,
files=None,
data=None,
):
"""Generic asynchronous HTTP request method with error handling."""
url = f"{self.base_api_url}/{endpoint}"
async with ClientSession() as session:
try:
# Preparing the request body for file uploads if files are present
if files:
data = aiohttp.FormData()
for file_name, file_content in files.items():
data.add_field(
file_name, file_content, filename=file_name
)

# Sending the request
res = await session.request(
method=method,
url=url,
headers=headers,
params=params,
json=body,
)

# Check response status
if res.status == 200:
try:
json = await res.json()
return json
except ValueError:
return await res.text()
else:
# Determine how to process the response based on Content-Type
content_type = res.headers.get("Content-Type", "")
if "application/json" in content_type:
error_response = await res.json()
else:
error_response = await res.text()

# Specifically handle status code 400
if res.status == 400:
print(f"Error 400: Bad Request - {error_response}")

print(f"Error {res.status}: {error_response}")
return None

except Exception as err:
raise Exception(f"HTTP request failed: {err}") from err

async def post_request_async(
self, endpoint, body=None, files=None, data=None
):
"""Generic asynchronous POST Request Wrapper"""
print("hi")
return await self._api_request_async(
"POST",
endpoint,
headers=self._headers
if files is None
else self._headers_multipart_form_data,
auth=self._auth,
body=body,
files=files,
data=data,
)
60 changes: 44 additions & 16 deletions deepeval/chat_completion/retry.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,47 @@
from typing import Callable, Any
import random
import time
import openai


def call_openai_with_retry(
callable: Callable[[], Any], max_retries: int = 2
) -> Any:
for _ in range(max_retries):
try:
response = callable()
return response
except Exception as e:
print(f"An error occurred: {e}. Retrying...")
time.sleep(2)
continue

raise Exception(
"Max retries reached. Unable to make a successful API call to OpenAI."
)
def retry_with_exponential_backoff(
func,
initial_delay: float = 1,
exponential_base: float = 2,
jitter: bool = True,
max_retries: int = 10,
errors: tuple = (openai.RateLimitError,),
):
"""Retry a function with exponential backoff."""

def wrapper(*args, **kwargs):
# Initialize variables
num_retries = 0
delay = initial_delay

# Loop until a successful response or max_retries is hit or an exception is raised
while True:
try:
return func(*args, **kwargs)

# Retry on specified errors
except errors as e:
# Increment retries
num_retries += 1

# Check if max retries has been reached
if num_retries > max_retries:
raise Exception(
f"Maximum number of retries ({max_retries}) exceeded."
)

# Increment the delay
delay *= exponential_base * (1 + jitter * random.random())

# Sleep for the delay
time.sleep(delay)

# Raise exceptions for any errors not specified
except Exception as e:
raise e

return wrapper
66 changes: 0 additions & 66 deletions deepeval/cli/azure_openai.py

This file was deleted.

Loading
Loading