Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge from main. #25

Merged
merged 18 commits into from
Jan 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 28 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,15 @@
</a>
</p>

**DeepEval** is a simple-to-use, open-source evaluation framework for LLM applications. It is similar to Pytest but specialized for unit testing LLM applications. DeepEval evaluates performance based on metrics such as hallucination, answer relevancy, RAGAS, etc., using LLMs and various other NLP models **locally on your machine**.
**DeepEval** is a simple-to-use, open-source LLM evaluation framework for LLM applications. It is similar to Pytest but specialized for unit testing LLM applications. DeepEval evaluates performance based on metrics such as hallucination, answer relevancy, RAGAS, etc., using LLMs and various other NLP models that runs **locally on your machine**.

Whether your application is implemented via RAG or fine-tuning, LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal hyperparameters to improve your RAG pipeline, prevent prompt drifting, or even transition from OpenAI to hosting your own Llama2 with confidence.

<br />

# Features

- Large variety of ready-to-use evaluation metrics powered by LLMs (all with explanations), statistical methods, or NLP models that runs **locally on your machine**:
- Large variety of ready-to-use LLM evaluation metrics powered by LLMs (all with explanations), statistical methods, or NLP models that runs **locally on your machine**:
- Hallucination
- Summarization
- Answer Relevancy
Expand All @@ -38,8 +38,8 @@ Whether your application is implemented via RAG or fine-tuning, LangChain or Lla
- Toxicity
- Bias
- etc.
- Evaluate your entire dataset in bulk in under 20 lines of Python code **in parallel**. Do this via the CLI in a Pytest-like manner, or through our `evaluate()` function.
- Easily create your own custom metrics that are automatically integrated with DeepEval's ecosystem by inheriting DeepEval's base metric class.
- Evaluate your entire dataset in bulk in under 20 lines of Python code **in parallel**.
- [Automatically integrated with Confident AI](https://app.confident-ai.com) for continous evaluation throughout the lifetime of your LLM (app):
- log evaluation results and analyze metrics pass / fails
- compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results
Expand Down Expand Up @@ -115,6 +115,29 @@ deepeval test run test_chatbot.py

<br />

## Evaluating Without Pytest Integration

Alternatively, you can evaluate without Pytest, which is more suited for a notebook environment.

```python
from deepeval import evalate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

input = "What if these shoes don't fit?"
context = ["All customers are eligible for a 30 day full refund at no extra costs."]
# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra costs."

hallucination_metric = HallucinationMetric(minimum_score=0.7)
test_case = LLMTestCase(
input=input,
actual_output=actual_output,
context=context
)
evalate([test_case], [hallucination_metric])
```

## Evaluting a Dataset / Test Cases in Bulk

In DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate things in bulk:
Expand Down Expand Up @@ -148,7 +171,7 @@ deepeval test run test_<filename>.py -n 4

<br/>

Alternatively, although we recommend using `deepeval test run`, you can evaluate a dataset/test cases without using pytest:
Alternatively, although we recommend using `deepeval test run`, you can evaluate a dataset/test cases without using our Pytest integration:

```python
from deepeval import evaluate
Expand All @@ -168,6 +191,7 @@ We offer a [free web platform](https://app.confident-ai.com) for you to:
3. Compare and pick the optimal hyperparameteres (prompt templates, models, chunk size, etc.).
4. Create, manage, and centralize your evaluation datasets.
5. Track events in production and augment your evaluation dataset for continous evaluation in production.
6. Track events in production and view live evaluation results over time.

Everything on Confident AI, including how to use Confident is available [here](https://docs.confident-ai.com/docs/confident-ai-introduction).

Expand Down
2 changes: 1 addition & 1 deletion deepeval/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
import re

# Optionally add telemtry
from .telemetry import *
from ._version import __version__

from .decorators.hyperparameters import set_hyperparameters
from deepeval.event import track
from deepeval.evaluate import evaluate, run_test, assert_test
from deepeval.test_run import on_test_run_end
from deepeval.telemetry import *

__all__ = [
"set_hyperparameters",
Expand Down
2 changes: 1 addition & 1 deletion deepeval/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__: str = "0.20.43"
__version__: str = "0.20.44"
2 changes: 2 additions & 0 deletions deepeval/cli/test.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from deepeval.test_run import test_run_manager, TEMP_FILE_NAME
from deepeval.utils import delete_file_if_exists
from deepeval.test_run import invoke_test_run_end_hook
from deepeval.telemetry import capture_evaluation_count

app = typer.Typer(name="test")

Expand Down Expand Up @@ -74,6 +75,7 @@ def run(
pytest_args.extend(["-p", "plugins"])

retcode = pytest.main(pytest_args)
capture_evaluation_count()
test_run_manager.wrap_up_test_run()
invoke_test_run_end_hook()

Expand Down
3 changes: 3 additions & 0 deletions deepeval/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from dataclasses import dataclass
import copy

from deepeval.telemetry import capture_evaluation_count
from deepeval.progress_context import progress_context
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase
Expand Down Expand Up @@ -90,6 +91,7 @@ def run_test(
test_run_manager.reset()
with progress_context("Executing run_test()..."):
test_result = execute_test([test_case], metrics, False)[0]
capture_evaluation_count()
print_test_result(test_result)
print("")
print("-" * 70)
Expand Down Expand Up @@ -120,6 +122,7 @@ def evaluate(test_cases: List[LLMTestCase], metrics: List[BaseMetric]):
test_run_manager.reset()
with progress_context("Evaluating testcases..."):
test_results = execute_test(test_cases, metrics, True)
capture_evaluation_count()
for test_result in test_results:
print_test_result(test_result)
print("")
Expand Down
31 changes: 15 additions & 16 deletions deepeval/metrics/answer_relevancy.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from deepeval.metrics import BaseMetric
from deepeval.models import GPTModel
from deepeval.templates import AnswerRelevancyTemplate
from deepeval.progress_context import metrics_progress_context


class AnswerRelvancyVerdict(BaseModel):
Expand Down Expand Up @@ -35,24 +36,22 @@ def measure(self, test_case: LLMTestCase) -> float:
raise ValueError(
"Input, actual output, or retrieval context cannot be None"
)
print(
"✨ 🍰 ✨ You're using DeepEval's latest Answer Relevancy Metric! This may take a minute..."
)
self.key_points: List[str] = self._generate_key_points(
test_case.actual_output, "\n".join(test_case.retrieval_context)
)
self.verdicts: List[AnswerRelvancyVerdict] = self._generate_verdicts(
test_case.input
)
with metrics_progress_context(self.__name__):
self.key_points: List[str] = self._generate_key_points(
test_case.actual_output, "\n".join(test_case.retrieval_context)
)
self.verdicts: List[
AnswerRelvancyVerdict
] = self._generate_verdicts(test_case.input)

answer_relevancy_score = self._generate_score()
answer_relevancy_score = self._generate_score()

self.reason = self._generate_reason(
test_case.input, test_case.actual_output, answer_relevancy_score
)
self.success = answer_relevancy_score >= self.minimum_score
self.score = answer_relevancy_score
return self.score
self.reason = self._generate_reason(
test_case.input, test_case.actual_output, answer_relevancy_score
)
self.success = answer_relevancy_score >= self.minimum_score
self.score = answer_relevancy_score
return self.score

def _generate_score(self):
relevant_count = 0
Expand Down
34 changes: 17 additions & 17 deletions deepeval/metrics/contextual_precision.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from deepeval.metrics import BaseMetric
from deepeval.models import GPTModel
from deepeval.templates import ContextualPrecisionTemplate
from deepeval.progress_context import metrics_progress_context


class ContextualPrecisionVerdict(BaseModel):
Expand Down Expand Up @@ -36,25 +37,24 @@ def measure(self, test_case: LLMTestCase) -> float:
raise ValueError(
"Input, actual output, expected output, or retrieval context cannot be None"
)
print(
"✨ 🍰 ✨ You're using DeepEval's latest Contextual Precision Metric! This may take a minute..."
)
self.verdicts: List[
ContextualPrecisionVerdict
] = self._generate_verdicts(
test_case.input,
test_case.expected_output,
test_case.retrieval_context,
)
contextual_precision_score = self._generate_score()

self.reason = self._generate_reason(
test_case.input, contextual_precision_score
)
with metrics_progress_context(self.__name__):
self.verdicts: List[
ContextualPrecisionVerdict
] = self._generate_verdicts(
test_case.input,
test_case.expected_output,
test_case.retrieval_context,
)
contextual_precision_score = self._generate_score()

self.reason = self._generate_reason(
test_case.input, contextual_precision_score
)

self.success = contextual_precision_score >= self.minimum_score
self.score = contextual_precision_score
return self.score
self.success = contextual_precision_score >= self.minimum_score
self.score = contextual_precision_score
return self.score

def _generate_reason(self, input: str, score: float):
if self.include_reason is False:
Expand Down
27 changes: 14 additions & 13 deletions deepeval/metrics/contextual_recall.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from deepeval.metrics import BaseMetric
from deepeval.models import GPTModel
from deepeval.templates import ContextualRecallTemplate
from deepeval.progress_context import metrics_progress_context


class ContextualRecallVerdict(BaseModel):
Expand Down Expand Up @@ -36,22 +37,22 @@ def measure(self, test_case: LLMTestCase) -> float:
raise ValueError(
"Input, actual output, expected output, or retrieval context cannot be None"
)
print(
"✨ 🍰 ✨ You're using DeepEval's latest Contextual Recall Metric! This may take a minute..."
)
self.verdicts: List[ContextualRecallVerdict] = self._generate_verdicts(
test_case.expected_output, test_case.retrieval_context
)
with metrics_progress_context(self.__name__):
self.verdicts: List[
ContextualRecallVerdict
] = self._generate_verdicts(
test_case.expected_output, test_case.retrieval_context
)

contextual_recall_score = self._generate_score()
contextual_recall_score = self._generate_score()

self.reason = self._generate_reason(
test_case.expected_output, contextual_recall_score
)
self.reason = self._generate_reason(
test_case.expected_output, contextual_recall_score
)

self.success = contextual_recall_score >= self.minimum_score
self.score = contextual_recall_score
return self.score
self.success = contextual_recall_score >= self.minimum_score
self.score = contextual_recall_score
return self.score

def _generate_reason(self, expected_output: str, score: float):
if self.include_reason is False:
Expand Down
29 changes: 14 additions & 15 deletions deepeval/metrics/contextual_relevancy.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from deepeval.metrics import BaseMetric
from deepeval.models import GPTModel
from deepeval.templates import ContextualRelevancyTemplate
from deepeval.progress_context import metrics_progress_context


class ContextualRelevancyVerdict(BaseModel):
Expand Down Expand Up @@ -35,24 +36,22 @@ def measure(self, test_case: LLMTestCase) -> float:
raise ValueError(
"Input, actual output, or retrieval context cannot be None"
)
print(
"✨ 🍰 ✨ You're using DeepEval's latest Contextual Relevancy Metric! This may take a minute..."
)
self.verdicts_list: List[
List[ContextualRelevancyVerdict]
] = self._generate_verdicts_list(
test_case.input, test_case.retrieval_context
)
contextual_recall_score = self._generate_score()
with metrics_progress_context(self.__name__):
self.verdicts_list: List[
List[ContextualRelevancyVerdict]
] = self._generate_verdicts_list(
test_case.input, test_case.retrieval_context
)
contextual_recall_score = self._generate_score()

self.reason = self._generate_reason(
test_case.input, contextual_recall_score
)
self.reason = self._generate_reason(
test_case.input, contextual_recall_score
)

self.success = contextual_recall_score >= self.minimum_score
self.score = contextual_recall_score
self.success = contextual_recall_score >= self.minimum_score
self.score = contextual_recall_score

return self.score
return self.score

def _generate_reason(self, input: str, score: float):
if self.include_reason is False:
Expand Down
31 changes: 15 additions & 16 deletions deepeval/metrics/faithfulness.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from deepeval.utils import trimToJson
from deepeval.models import GPTModel
from deepeval.templates import FaithfulnessTemplate
from deepeval.progress_context import metrics_progress_context


class FaithfulnessVerdict(BaseModel):
Expand Down Expand Up @@ -37,22 +38,20 @@ def measure(self, test_case: LLMTestCase):
raise ValueError(
"Input, actual output, or retrieval context cannot be None"
)
print(
"✨ 🍰 ✨ You're using DeepEval's latest Faithfulness Metric! This may take a minute..."
)
self.truths_list: List[List[str]] = self._generate_truths_list(
test_case.retrieval_context
)
self.verdicts_list: List[
List[FaithfulnessVerdict]
] = self._generate_verdicts_list(
self.truths_list, test_case.actual_output
)
faithfulness_score = self._generate_score()
self.reason = self._generate_reason(faithfulness_score)
self.success = faithfulness_score >= self.minimum_score
self.score = faithfulness_score
return self.score
with metrics_progress_context(self.__name__):
self.truths_list: List[List[str]] = self._generate_truths_list(
test_case.retrieval_context
)
self.verdicts_list: List[
List[FaithfulnessVerdict]
] = self._generate_verdicts_list(
self.truths_list, test_case.actual_output
)
faithfulness_score = self._generate_score()
self.reason = self._generate_reason(faithfulness_score)
self.success = faithfulness_score >= self.minimum_score
self.score = faithfulness_score
return self.score

def _generate_score(self):
total_verdicts = 0
Expand Down
16 changes: 16 additions & 0 deletions deepeval/progress_context.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,19 @@ def progress_context(
) as progress:
progress.add_task(description=description, total=total)
yield


@contextmanager
def metrics_progress_context(
metric_name: str, total: int = 9999, transient: bool = True
):
description = f"✨ 🍰 ✨ You're using DeepEval's latest {metric_name} Metric! This may take a minute..."
console = Console(file=sys.stderr) # Direct output to standard error
with Progress(
SpinnerColumn(),
TextColumn("[progress.description]{task.description}"),
console=console, # Use the custom console
transient=transient,
) as progress:
progress.add_task(description=description, total=total)
yield
Loading
Loading