Skip to content

Commit

Permalink
Merge pull request #18 from confident-ai/main
Browse files Browse the repository at this point in the history
merge from main.
  • Loading branch information
Anindyadeep authored Dec 8, 2023
2 parents 68f277c + a285970 commit d3c1814
Show file tree
Hide file tree
Showing 64 changed files with 2,137 additions and 1,196 deletions.
93 changes: 79 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,22 +18,35 @@
</a>
</p>

**DeepEval** is a simple-to-use, open-source evaluation framework for LLM applications. It is similar to Pytest but specialized for unit testing LLM applications. DeepEval evaluates performance based on metrics such as factual consistency, accuracy, answer relevancy, etc., using LLMs and various other NLP models. It's a production-ready alternative to RAGAS .
**DeepEval** is a simple-to-use, open-source evaluation framework for LLM applications. It is similar to Pytest but specialized for unit testing LLM applications. DeepEval evaluates performance based on metrics such as hallucination, answer relevancy, RAGAS, etc., using LLMs and various other NLP models **locally on your machine**.

Whether your application is implemented via RAG or fine-tuning, LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal hyperparameters to improve your RAG pipeline, prevent prompt drifting, or even transition from OpenAI to hosting your own Llama2 with confidence.

<br />

# Features

- Large variety of ready-to-use evaluation metrics, ranging from LLM evaluated (G-Eval) to metrics computed via statistical methods or NLP models.
- Large variety of ready-to-use evaluation metrics powered by LLMs, statistical methods, or NLP models that runs **locally on your machine**:
- Hallucination
- Answer Relevancy
- RAGAS
- G-Eval
- Toxicity
- Bias
- etc.
- Easily create your own custom metrics that are automatically integrated with DeepEval's ecosystem by inheriting DeepEval's base metric class.
- Evaluate your entire dataset in bulk using fewer than 20 lines of Python code.
- [Integrated with Confident AI](https://confident-ai.com) for instant observability into evaluation results and hyperparameter comparisons (such as prompt templates and model version used).
- Evaluate your entire dataset in bulk using fewer than 20 lines of Python code **in parallel**.
- [Automatically integrated with Confident AI](https://app.confident-ai.com) for continous evaluation throughout the lifetime of your LLM (app):
- log evaluation results and analyze metrics pass / fails
- compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results
- debug evaluation results via LLM traces
- manage evaluation test cases / datasets in one place
- track events to identify live LLM responses in production
- add production events to existing evaluation datasets to strength evals over time

<br />

# Getting Started 🚀
# 🚀 Getting Started 🚀

Let's pretend your LLM application is a customer support chatbot; here's how DeepEval can help test what you've built.

Expand All @@ -43,9 +56,9 @@ Let's pretend your LLM application is a customer support chatbot; here's how Dee
pip install -U deepeval
```

## [Optional] Create an account
## Create an account (highly recommended)

Creating an account on our platform will allow you to log test results, enabling easy tracking of changes and performances over iterations. This step is optional, and you can run test cases even without logging in, but we highly recommend giving it a try.
Although optional, creating an account on our platform will allow you to log test results, enabling easy tracking of changes and performances over iterations. This step is optional, and you can run test cases even without logging in, but we highly recommend giving it a try.

To login, run:

Expand All @@ -67,9 +80,9 @@ Open `test_chatbot.py` and write your first test case using DeepEval:

```python
import pytest
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase
from deepeval.evaluator import assert_test

def test_case():
input = "What if these shoes don't fit?"
Expand Down Expand Up @@ -98,9 +111,61 @@ deepeval test run test_chatbot.py

<br />

# View results on our platform
## Evaluting a Dataset / Test Cases in Bulk

We offer a [free web platform](https://app.confident-ai.com) for you to log and view all test results from DeepEval test runs. Our platform allows you to quickly draw insights on how your metrics are improving with each test run and to determine the optimal parameters (such as prompt templates, models, retrieval pipeline) for your specific LLM application.
In DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate things in bulk:

```python
import pytest
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset

first_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])
second_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])

dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])

@pytest.mark.parametrize(
"test_case",
dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
hallucination_metric = HallucinationMetric(minimum_score=0.3)
answer_relevancy_metric = AnswerRelevancyMetric(minimum_score=0.5)
assert_test(test_case, [hallucination_metric, answer_relevancy_metric])
```

```bash
# Run this in the CLI, you can also add an optional -n flag to run tests in parallel
deepeval test run test_<filename>.py -n 4
```

<br/>

Alternatively, although we recommend using `deepeval test run`, you can evaluate a dataset/test cases without using pytest:

```python
from deepeval import evaluate
...

evaluate(dataset, [hallucination_metric])
# or
dataset.evaluate([hallucination_metric])
```

# View results on Confident AI

We offer a [free web platform](https://app.confident-ai.com) for you to:

1. Log and view all test results / metrics data from DeepEval's test runs.
2. Debug evaluation results via LLM traces
3. Compare and pick the optimal hyperparameteres (prompt templates, models, chunk size, etc.).
4. Create, manage, and centralize your evaluation datasets.
5. Track events in production and augment your evaluation dataset for continous evaluation in production.

Everything on Confident AI, including how to use Confident is available [here](https://docs.confident-ai.com/docs/confident-ai-introduction).

To begin, login from the CLI:

Expand All @@ -118,7 +183,7 @@ deepeval test run test_chatbot.py

You should see a link displayed in the CLI once the test has finished running. Paste it into your browser to view the results!

![ok](https://d2lsxfc3p6r9rv.cloudfront.net/test-summary.png)
![ok](https://d2lsxfc3p6r9rv.cloudfront.net/confident-test-cases.png)

<br />

Expand All @@ -133,9 +198,9 @@ Please read [CONTRIBUTING.md](https://github.com/confident-ai/deepeval/blob/main
Features:

- [x] Implement G-Eval
- [ ] Referenceless Evaluation
- [ ] Production Evaluation & Logging
- [ ] Evaluation Dataset Creation
- [x] Referenceless Evaluation
- [x] Production Evaluation & Logging
- [x] Evaluation Dataset Creation

Integrations:

Expand Down
12 changes: 10 additions & 2 deletions deepeval/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,16 @@
from ._version import __version__

from .decorators.hyperparameters import set_hyperparameters

__all__ = ["set_hyperparameters"]
from deepeval.event import track
from deepeval.evaluate import evaluate, run_test, assert_test

__all__ = [
"set_hyperparameters",
"track",
"evaluate",
"run_test",
"assert_test",
]


def compare_versions(version1, version2):
Expand Down
2 changes: 1 addition & 1 deletion deepeval/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__: str = "0.20.24"
__version__: str = "0.20.29"
41 changes: 15 additions & 26 deletions deepeval/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,9 @@


class Endpoints(Enum):
CREATE_DATASET_ENDPOINT = "/v1/dataset"
CREATE_TEST_RUN_ENDPOINT = "/v1/test-run"
DATASET_ENDPOINT = "/v1/dataset"
TEST_RUN_ENDPOINT = "/v1/test-run"
EVENT_ENDPOINT = "/v1/event"


class Api:
Expand Down Expand Up @@ -132,7 +133,6 @@ def _api_request(
data=None,
):
"""Generic HTTP request method with error handling."""

url = f"{self.base_api_url}/{endpoint}"
res = self._http_request(
method,
Expand All @@ -154,30 +154,19 @@ def _api_request(
except ValueError:
# Some endpoints only return 'OK' message without JSON
return json
elif (
res.status_code == 409
and "task" in endpoint
and body.get("unique_id")
):
retry_history = res.raw.retries.history
# Example RequestHistory tuple
# RequestHistory(method='POST',
# url='/v1/task/imageannotation',
# error=None,
# status=409,
# redirect_location=None)
if retry_history != ():
# See if the first retry was a 500 or 503 error
if retry_history[0][3] >= 500:
uuid = body["unique_id"]
newUrl = f"{self.base_api_url}/tasks?unique_id={uuid}"
# grab task from api
newRes = self._http_request(
"GET", newUrl, headers=headers, auth=auth
)
json = newRes.json()["docs"][0]
elif res.status_code == 409:
message = res.json().get("message", "Conflict occurred.")

# Prompt user for input
user_input = input(f"{message} [y/N]: ").strip().lower()
if user_input == "y":
body["overwrite"] = True
return self._api_request(
method, endpoint, headers, auth, params, body, files, data
)
else:
self._raise_on_response(res)
print("Aborted.")
return None
else:
self._raise_on_response(res)
return json
Expand Down
1 change: 1 addition & 0 deletions deepeval/check/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .check import check
6 changes: 6 additions & 0 deletions deepeval/check/benchmarks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from enum import Enum


class BenchmarkType(Enum):
HELM = "Stanford HELM"
LM_HARNESS = "LM Harness"
21 changes: 21 additions & 0 deletions deepeval/check/check.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
from typing import Union

from .benchmarks import BenchmarkType


def check(benchmark: Union[str, BenchmarkType]):
if benchmark == BenchmarkType.HELM:
handleHELMCheck()
if benchmark == BenchmarkType.LM_HARNESS:
handleLMHarnessCheck()
else:
# catch all for custom benchmark checks
pass


def handleHELMCheck():
pass


def handleLMHarnessCheck():
pass
7 changes: 6 additions & 1 deletion deepeval/dataset/api.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from pydantic import BaseModel, Field
from typing import Optional, List
from typing import Optional, List, Union


class Golden(BaseModel):
Expand All @@ -11,8 +11,13 @@ class Golden(BaseModel):

class APIDataset(BaseModel):
alias: str
overwrite: bool
goldens: Optional[List[Golden]] = Field(default=None)


class CreateDatasetHttpResponse(BaseModel):
link: str


class DatasetHttpResponse(BaseModel):
goldens: List[Golden]
65 changes: 45 additions & 20 deletions deepeval/dataset/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,27 @@

from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase
from deepeval.evaluator import evaluate
from deepeval.api import Api, Endpoints
from deepeval.dataset.utils import convert_test_cases_to_goldens
from deepeval.dataset.api import APIDataset, CreateDatasetHttpResponse
from deepeval.dataset.utils import (
convert_test_cases_to_goldens,
convert_goldens_to_test_cases,
)
from deepeval.dataset.api import (
APIDataset,
CreateDatasetHttpResponse,
Golden,
DatasetHttpResponse,
)


@dataclass
class EvaluationDataset:
test_cases: List[LLMTestCase]
goldens: List[Golden]

def __init__(self, test_cases: List[LLMTestCase] = []):
self.test_cases = test_cases
self.goldens = []

def add_test_case(self, test_case: LLMTestCase):
self.test_cases.append(test_case)
Expand All @@ -28,7 +37,7 @@ def __iter__(self):
return iter(self.test_cases)

def evaluate(self, metrics: List[BaseMetric]):
from deepeval.evaluator import evaluate
from deepeval import evaluate

return evaluate(self.test_cases, metrics)

Expand Down Expand Up @@ -234,29 +243,45 @@ def push(self, alias: str):
)
if os.path.exists(".deepeval"):
goldens = convert_test_cases_to_goldens(self.test_cases)
body = APIDataset(alias=alias, goldens=goldens).model_dump(
by_alias=True, exclude_none=True
)
body = APIDataset(
alias=alias, overwrite=False, goldens=goldens
).model_dump(by_alias=True, exclude_none=True)
api = Api()
result = api.post_request(
endpoint=Endpoints.CREATE_DATASET_ENDPOINT.value,
endpoint=Endpoints.DATASET_ENDPOINT.value,
body=body,
)
response = CreateDatasetHttpResponse(
link=result["link"],
)
link = response.link
console = Console()
console.print(
"✅ Dataset pushed to Confidnet AI! View on "
f"[link={link}]{link}[/link]"
)
# webbrowser.open(link)
if result:
response = CreateDatasetHttpResponse(
link=result["link"],
)
link = response.link
console = Console()
console.print(
"✅ Dataset successfully pushed to Confidnet AI! View at "
f"[link={link}]{link}[/link]"
)
webbrowser.open(link)
else:
raise Exception(
"To push dataset to Confident AI, run `deepeval login`"
)

# TODO
def pull(self, alias: str):
pass
if os.path.exists(".deepeval"):
api = Api()
result = api.get_request(
endpoint=Endpoints.DATASET_ENDPOINT.value,
params={"alias": alias},
)
response = DatasetHttpResponse(
goldens=result["goldens"],
)
self.goldens.extend(response.goldens)

# TODO: make this conversion at evaluation time instead
self.test_cases.extend(convert_goldens_to_test_cases(self.goldens))
else:
raise Exception(
"Run `deepeval login` to pull dataset from Confident AI"
)
Loading

0 comments on commit d3c1814

Please sign in to comment.