Skip to content

Commit

Permalink
Merge pull request #27 from confident-ai/main
Browse files Browse the repository at this point in the history
Merge from main.
  • Loading branch information
Anindyadeep authored Jan 14, 2024
2 parents 2109a70 + 6b4e6e9 commit 6e56531
Show file tree
Hide file tree
Showing 74 changed files with 1,682 additions and 829 deletions.
Binary file modified .DS_Store
Binary file not shown.
46 changes: 37 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,24 +18,28 @@
</a>
</p>

**DeepEval** is a simple-to-use, open-source evaluation framework for LLM applications. It is similar to Pytest but specialized for unit testing LLM applications. DeepEval evaluates performance based on metrics such as hallucination, answer relevancy, RAGAS, etc., using LLMs and various other NLP models **locally on your machine**.
**DeepEval** is a simple-to-use, open-source LLM evaluation framework for LLM applications. It is similar to Pytest but specialized for unit testing LLM applications. DeepEval evaluates performance based on metrics such as hallucination, answer relevancy, RAGAS, etc., using LLMs and various other NLP models that runs **locally on your machine**.

Whether your application is implemented via RAG or fine-tuning, LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal hyperparameters to improve your RAG pipeline, prevent prompt drifting, or even transition from OpenAI to hosting your own Llama2 with confidence.

<br />

# Features

- Large variety of ready-to-use evaluation metrics powered by LLMs, statistical methods, or NLP models that runs **locally on your machine**:
- Large variety of ready-to-use LLM evaluation metrics powered by LLMs (all with explanations), statistical methods, or NLP models that runs **locally on your machine**:
- Hallucination
- Summarization
- Answer Relevancy
- Faithfulness
- Contextual Recall
- Contextual Precision
- RAGAS
- G-Eval
- Toxicity
- Bias
- etc.
- Evaluate your entire dataset in bulk in under 20 lines of Python code **in parallel**. Do this via the CLI in a Pytest-like manner, or through our `evaluate()` function.
- Easily create your own custom metrics that are automatically integrated with DeepEval's ecosystem by inheriting DeepEval's base metric class.
- Evaluate your entire dataset in bulk using fewer than 20 lines of Python code **in parallel**.
- [Automatically integrated with Confident AI](https://app.confident-ai.com) for continous evaluation throughout the lifetime of your LLM (app):
- log evaluation results and analyze metrics pass / fails
- compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results
Expand Down Expand Up @@ -90,7 +94,7 @@ def test_case():

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra costs."
hallucination_metric = HallucinationMetric(minimum_score=0.7)
hallucination_metric = HallucinationMetric(threshold=0.7)
test_case = LLMTestCase(input=input, actual_output=actual_output, context=context)
assert_test(test_case, [hallucination_metric])
```
Expand All @@ -104,13 +108,36 @@ deepeval test run test_chatbot.py
**Your test should have passed ✅** Let's breakdown what happened.

- The variable `input` mimics user input, and `actual_output` is a placeholder for your chatbot's intended output based on this query.
- The variable `context` contains the relevant information from your knowledge base, and `HallucinationMetric(minimum_score=0.7)` is an out-of-the-box metric provided by DeepEval. It helps you evaluate the factual accuracy of your chatbot's output based on the provided context.
- The metric score ranges from 0 - 1. The `minimum_score=0.7` threshold ultimately determines whether your test has passed or not.
- The variable `context` contains the relevant information from your knowledge base, and `HallucinationMetric(threshold=0.7)` is an out-of-the-box metric provided by DeepEval. It helps you evaluate the factual accuracy of your chatbot's output based on the provided context.
- The metric score ranges from 0 - 1. The `threshold=0.7` threshold ultimately determines whether your test has passed or not.

[Read our documentation](https://docs.confident-ai.com/docs/getting-started) for more information on how to use additional metrics, create your own custom metrics, and tutorials on how to integrate with other tools like LangChain and LlamaIndex.

<br />

## Evaluating Without Pytest Integration

Alternatively, you can evaluate without Pytest, which is more suited for a notebook environment.

```python
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

input = "What if these shoes don't fit?"
context = ["All customers are eligible for a 30 day full refund at no extra costs."]
# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra costs."

hallucination_metric = HallucinationMetric(threshold=0.7)
test_case = LLMTestCase(
input=input,
actual_output=actual_output,
context=context
)
evaluate([test_case], [hallucination_metric])
```

## Evaluting a Dataset / Test Cases in Bulk

In DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate things in bulk:
Expand All @@ -132,8 +159,8 @@ dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])
dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
hallucination_metric = HallucinationMetric(minimum_score=0.3)
answer_relevancy_metric = AnswerRelevancyMetric(minimum_score=0.5)
hallucination_metric = HallucinationMetric(threshold=0.3)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
assert_test(test_case, [hallucination_metric, answer_relevancy_metric])
```

Expand All @@ -144,7 +171,7 @@ deepeval test run test_<filename>.py -n 4

<br/>

Alternatively, although we recommend using `deepeval test run`, you can evaluate a dataset/test cases without using pytest:
Alternatively, although we recommend using `deepeval test run`, you can evaluate a dataset/test cases without using our Pytest integration:

```python
from deepeval import evaluate
Expand All @@ -164,6 +191,7 @@ We offer a [free web platform](https://app.confident-ai.com) for you to:
3. Compare and pick the optimal hyperparameteres (prompt templates, models, chunk size, etc.).
4. Create, manage, and centralize your evaluation datasets.
5. Track events in production and augment your evaluation dataset for continous evaluation in production.
6. Track events in production and view live evaluation results over time.

Everything on Confident AI, including how to use Confident is available [here](https://docs.confident-ai.com/docs/confident-ai-introduction).

Expand Down
2 changes: 1 addition & 1 deletion deepeval/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
import re

# Optionally add telemtry
from .telemetry import *
from ._version import __version__

from .decorators.hyperparameters import set_hyperparameters
from deepeval.event import track
from deepeval.evaluate import evaluate, run_test, assert_test
from deepeval.test_run import on_test_run_end
from deepeval.telemetry import *

__all__ = [
"set_hyperparameters",
Expand Down
2 changes: 1 addition & 1 deletion deepeval/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__: str = "0.20.42"
__version__: str = "0.20.46"
80 changes: 79 additions & 1 deletion deepeval/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,13 @@
import requests
import warnings
from requests.adapters import HTTPAdapter, Response, Retry
import aiohttp
from aiohttp import ClientSession
from requests.adapters import HTTPAdapter
from enum import Enum

from deepeval.constants import API_KEY_ENV
from deepeval.key_handler import KEY_FILE_HANDLER, KeyValues
from enum import Enum

API_BASE_URL = "https://app.confident-ai.com/api"

Expand Down Expand Up @@ -259,3 +263,77 @@ def quote_string(text: str) -> str:
str: Quoted text in return
"""
return urllib.parse.quote(text, safe="")

async def _api_request_async(
self,
method,
endpoint,
headers=None,
auth=None,
params=None,
body=None,
files=None,
data=None,
):
"""Generic asynchronous HTTP request method with error handling."""
url = f"{self.base_api_url}/{endpoint}"
async with ClientSession() as session:
try:
# Preparing the request body for file uploads if files are present
if files:
data = aiohttp.FormData()
for file_name, file_content in files.items():
data.add_field(
file_name, file_content, filename=file_name
)

# Sending the request
res = await session.request(
method=method,
url=url,
headers=headers,
params=params,
json=body,
)

# Check response status
if res.status == 200:
try:
json = await res.json()
return json
except ValueError:
return await res.text()
else:
# Determine how to process the response based on Content-Type
content_type = res.headers.get("Content-Type", "")
if "application/json" in content_type:
error_response = await res.json()
else:
error_response = await res.text()

# Specifically handle status code 400
if res.status == 400:
print(f"Error 400: Bad Request - {error_response}")

print(f"Error {res.status}: {error_response}")
return None

except Exception as err:
raise Exception(f"HTTP request failed: {err}") from err

async def post_request_async(
self, endpoint, body=None, files=None, data=None
):
"""Generic asynchronous POST Request Wrapper"""
print("hi")
return await self._api_request_async(
"POST",
endpoint,
headers=self._headers
if files is None
else self._headers_multipart_form_data,
auth=self._auth,
body=body,
files=files,
data=data,
)
60 changes: 44 additions & 16 deletions deepeval/chat_completion/retry.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,47 @@
from typing import Callable, Any
import random
import time
import openai


def call_openai_with_retry(
callable: Callable[[], Any], max_retries: int = 2
) -> Any:
for _ in range(max_retries):
try:
response = callable()
return response
except Exception as e:
print(f"An error occurred: {e}. Retrying...")
time.sleep(2)
continue

raise Exception(
"Max retries reached. Unable to make a successful API call to OpenAI."
)
def retry_with_exponential_backoff(
func,
initial_delay: float = 1,
exponential_base: float = 2,
jitter: bool = True,
max_retries: int = 10,
errors: tuple = (openai.RateLimitError,),
):
"""Retry a function with exponential backoff."""

def wrapper(*args, **kwargs):
# Initialize variables
num_retries = 0
delay = initial_delay

# Loop until a successful response or max_retries is hit or an exception is raised
while True:
try:
return func(*args, **kwargs)

# Retry on specified errors
except errors as e:
# Increment retries
num_retries += 1

# Check if max retries has been reached
if num_retries > max_retries:
raise Exception(
f"Maximum number of retries ({max_retries}) exceeded."
)

# Increment the delay
delay *= exponential_base * (1 + jitter * random.random())

# Sleep for the delay
time.sleep(delay)

# Raise exceptions for any errors not specified
except Exception as e:
raise e

return wrapper
66 changes: 0 additions & 66 deletions deepeval/cli/azure_openai.py

This file was deleted.

Loading

0 comments on commit 6e56531

Please sign in to comment.