Merge pull request #18 from confident-ai/main

merge from main.
Anindyadeep · Dec 8, 2023 · d3c1814 · d3c1814
2 parents 68f277c + a285970
commit d3c1814
Show file tree

Hide file tree

Showing 64 changed files with 2,137 additions and 1,196 deletions.
diff --git a/README.md b/README.md
@@ -18,22 +18,35 @@
     </a>
 </p>
 
-**DeepEval** is a simple-to-use, open-source evaluation framework for LLM applications. It is similar to Pytest but specialized for unit testing LLM applications. DeepEval evaluates performance based on metrics such as factual consistency, accuracy, answer relevancy, etc., using LLMs and various other NLP models. It's a production-ready alternative to RAGAS .
+**DeepEval** is a simple-to-use, open-source evaluation framework for LLM applications. It is similar to Pytest but specialized for unit testing LLM applications. DeepEval evaluates performance based on metrics such as hallucination, answer relevancy, RAGAS, etc., using LLMs and various other NLP models **locally on your machine**.
 
 Whether your application is implemented via RAG or fine-tuning, LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal hyperparameters to improve your RAG pipeline, prevent prompt drifting, or even transition from OpenAI to hosting your own Llama2 with confidence.
 
 <br />
 
 # Features
 
-- Large variety of ready-to-use evaluation metrics, ranging from LLM evaluated (G-Eval) to metrics computed via statistical methods or NLP models.
+- Large variety of ready-to-use evaluation metrics powered by LLMs, statistical methods, or NLP models that runs **locally on your machine**:
+  - Hallucination
+  - Answer Relevancy
+  - RAGAS
+  - G-Eval
+  - Toxicity
+  - Bias
+  - etc.
 - Easily create your own custom metrics that are automatically integrated with DeepEval's ecosystem by inheriting DeepEval's base metric class.
-- Evaluate your entire dataset in bulk using fewer than 20 lines of Python code.
-- [Integrated with Confident AI](https://confident-ai.com) for instant observability into evaluation results and hyperparameter comparisons (such as prompt templates and model version used).
+- Evaluate your entire dataset in bulk using fewer than 20 lines of Python code **in parallel**.
+- [Automatically integrated with Confident AI](https://app.confident-ai.com) for continous evaluation throughout the lifetime of your LLM (app):
+  - log evaluation results and analyze metrics pass / fails
+  - compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results
+  - debug evaluation results via LLM traces
+  - manage evaluation test cases / datasets in one place
+  - track events to identify live LLM responses in production
+  - add production events to existing evaluation datasets to strength evals over time
 
 <br />
 
-# Getting Started 🚀
+# 🚀 Getting Started 🚀
 
 Let's pretend your LLM application is a customer support chatbot; here's how DeepEval can help test what you've built.
 
@@ -43,9 +56,9 @@ Let's pretend your LLM application is a customer support chatbot; here's how Dee
 pip install -U deepeval
 ```
 
-## [Optional] Create an account
+## Create an account (highly recommended)
 
-Creating an account on our platform will allow you to log test results, enabling easy tracking of changes and performances over iterations. This step is optional, and you can run test cases even without logging in, but we highly recommend giving it a try.
+Although optional, creating an account on our platform will allow you to log test results, enabling easy tracking of changes and performances over iterations. This step is optional, and you can run test cases even without logging in, but we highly recommend giving it a try.
 
 To login, run:
 
@@ -67,9 +80,9 @@ Open `test_chatbot.py` and write your first test case using DeepEval:
 
 ```python
 import pytest
+from deepeval import assert_test
 from deepeval.metrics import HallucinationMetric
 from deepeval.test_case import LLMTestCase
-from deepeval.evaluator import assert_test
 
 def test_case():
     input = "What if these shoes don't fit?"
@@ -98,9 +111,61 @@ deepeval test run test_chatbot.py
 
 <br />
 
-# View results on our platform
+## Evaluting a Dataset / Test Cases in Bulk
 
-We offer a [free web platform](https://app.confident-ai.com) for you to log and view all test results from DeepEval test runs. Our platform allows you to quickly draw insights on how your metrics are improving with each test run and to determine the optimal parameters (such as prompt templates, models, retrieval pipeline) for your specific LLM application.
+In DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate things in bulk:
+
+```python
+import pytest
+from deepeval import assert_test
+from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
+from deepeval.test_case import LLMTestCase
+from deepeval.dataset import EvaluationDataset
+
+first_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])
+second_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])
+
+dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])
+
+@pytest.mark.parametrize(
+    "test_case",
+    dataset,
+)
+def test_customer_chatbot(test_case: LLMTestCase):
+    hallucination_metric = HallucinationMetric(minimum_score=0.3)
+    answer_relevancy_metric = AnswerRelevancyMetric(minimum_score=0.5)
+    assert_test(test_case, [hallucination_metric, answer_relevancy_metric])
+```
+
+```bash
+# Run this in the CLI, you can also add an optional -n flag to run tests in parallel
+deepeval test run test_<filename>.py -n 4
+```
+
+<br/>
+
+Alternatively, although we recommend using `deepeval test run`, you can evaluate a dataset/test cases without using pytest:
+
+```python
+from deepeval import evaluate
+...
+
+evaluate(dataset, [hallucination_metric])
+# or
+dataset.evaluate([hallucination_metric])
+```
+
+# View results on Confident AI
+
+We offer a [free web platform](https://app.confident-ai.com) for you to:
+
+1. Log and view all test results / metrics data from DeepEval's test runs.
+2. Debug evaluation results via LLM traces
+3. Compare and pick the optimal hyperparameteres (prompt templates, models, chunk size, etc.).
+4. Create, manage, and centralize your evaluation datasets.
+5. Track events in production and augment your evaluation dataset for continous evaluation in production.
+
+Everything on Confident AI, including how to use Confident is available [here](https://docs.confident-ai.com/docs/confident-ai-introduction).
 
 To begin, login from the CLI:
 
@@ -118,7 +183,7 @@ deepeval test run test_chatbot.py
 
 You should see a link displayed in the CLI once the test has finished running. Paste it into your browser to view the results!
 
-![ok](https://d2lsxfc3p6r9rv.cloudfront.net/test-summary.png)
+![ok](https://d2lsxfc3p6r9rv.cloudfront.net/confident-test-cases.png)
 
 <br />
 
@@ -133,9 +198,9 @@ Please read [CONTRIBUTING.md](https://github.com/confident-ai/deepeval/blob/main
 Features:
 
 - [x] Implement G-Eval
-- [ ] Referenceless Evaluation
-- [ ] Production Evaluation & Logging
-- [ ] Evaluation Dataset Creation
+- [x] Referenceless Evaluation
+- [x] Production Evaluation & Logging
+- [x] Evaluation Dataset Creation
 
 Integrations:
 

diff --git a/deepeval/__init__.py b/deepeval/__init__.py
@@ -6,8 +6,16 @@
 from ._version import __version__
 
 from .decorators.hyperparameters import set_hyperparameters
-
-__all__ = ["set_hyperparameters"]
+from deepeval.event import track
+from deepeval.evaluate import evaluate, run_test, assert_test
+
+__all__ = [
+    "set_hyperparameters",
+    "track",
+    "evaluate",
+    "run_test",
+    "assert_test",
+]
 
 
 def compare_versions(version1, version2):

diff --git a/deepeval/_version.py b/deepeval/_version.py
@@ -1 +1 @@
-__version__: str = "0.20.24"
+__version__: str = "0.20.29"
diff --git a/deepeval/api.py b/deepeval/api.py
@@ -18,8 +18,9 @@
 
 
 class Endpoints(Enum):
-    CREATE_DATASET_ENDPOINT = "/v1/dataset"
-    CREATE_TEST_RUN_ENDPOINT = "/v1/test-run"
+    DATASET_ENDPOINT = "/v1/dataset"
+    TEST_RUN_ENDPOINT = "/v1/test-run"
+    EVENT_ENDPOINT = "/v1/event"
 
 
 class Api:
@@ -132,7 +133,6 @@ def _api_request(
         data=None,
     ):
         """Generic HTTP request method with error handling."""
-
         url = f"{self.base_api_url}/{endpoint}"
         res = self._http_request(
             method,
@@ -154,30 +154,19 @@ def _api_request(
             except ValueError:
                 # Some endpoints only return 'OK' message without JSON
                 return json
-        elif (
-            res.status_code == 409
-            and "task" in endpoint
-            and body.get("unique_id")
-        ):
-            retry_history = res.raw.retries.history
-            # Example RequestHistory tuple
-            # RequestHistory(method='POST',
-            #   url='/v1/task/imageannotation',
-            #   error=None,
-            #   status=409,
-            #   redirect_location=None)
-            if retry_history != ():
-                # See if the first retry was a 500 or 503 error
-                if retry_history[0][3] >= 500:
-                    uuid = body["unique_id"]
-                    newUrl = f"{self.base_api_url}/tasks?unique_id={uuid}"
-                    # grab task from api
-                    newRes = self._http_request(
-                        "GET", newUrl, headers=headers, auth=auth
-                    )
-                    json = newRes.json()["docs"][0]
+        elif res.status_code == 409:
+            message = res.json().get("message", "Conflict occurred.")
+
+            # Prompt user for input
+            user_input = input(f"{message} [y/N]: ").strip().lower()
+            if user_input == "y":
+                body["overwrite"] = True
+                return self._api_request(
+                    method, endpoint, headers, auth, params, body, files, data
+                )
             else:
-                self._raise_on_response(res)
+                print("Aborted.")
+                return None
         else:
             self._raise_on_response(res)
         return json

diff --git a/deepeval/check/__init__.py b/deepeval/check/__init__.py
@@ -0,0 +1 @@
+from .check import check
diff --git a/deepeval/check/benchmarks.py b/deepeval/check/benchmarks.py
@@ -0,0 +1,6 @@
+from enum import Enum
+
+
+class BenchmarkType(Enum):
+    HELM = "Stanford HELM"
+    LM_HARNESS = "LM Harness"
diff --git a/deepeval/check/check.py b/deepeval/check/check.py
@@ -0,0 +1,21 @@
+from typing import Union
+
+from .benchmarks import BenchmarkType
+
+
+def check(benchmark: Union[str, BenchmarkType]):
+    if benchmark == BenchmarkType.HELM:
+        handleHELMCheck()
+    if benchmark == BenchmarkType.LM_HARNESS:
+        handleLMHarnessCheck()
+    else:
+        # catch all for custom benchmark checks
+        pass
+
+
+def handleHELMCheck():
+    pass
+
+
+def handleLMHarnessCheck():
+    pass
diff --git a/deepeval/dataset/api.py b/deepeval/dataset/api.py
@@ -1,5 +1,5 @@
 from pydantic import BaseModel, Field
-from typing import Optional, List
+from typing import Optional, List, Union
 
 
 class Golden(BaseModel):
@@ -11,8 +11,13 @@ class Golden(BaseModel):
 
 class APIDataset(BaseModel):
     alias: str
+    overwrite: bool
     goldens: Optional[List[Golden]] = Field(default=None)
 
 
 class CreateDatasetHttpResponse(BaseModel):
     link: str
+
+
+class DatasetHttpResponse(BaseModel):
+    goldens: List[Golden]
diff --git a/deepeval/dataset/dataset.py b/deepeval/dataset/dataset.py
@@ -8,18 +8,27 @@
 
 from deepeval.metrics import BaseMetric
 from deepeval.test_case import LLMTestCase
-from deepeval.evaluator import evaluate
 from deepeval.api import Api, Endpoints
-from deepeval.dataset.utils import convert_test_cases_to_goldens
-from deepeval.dataset.api import APIDataset, CreateDatasetHttpResponse
+from deepeval.dataset.utils import (
+    convert_test_cases_to_goldens,
+    convert_goldens_to_test_cases,
+)
+from deepeval.dataset.api import (
+    APIDataset,
+    CreateDatasetHttpResponse,
+    Golden,
+    DatasetHttpResponse,
+)
 
 
 @dataclass
 class EvaluationDataset:
     test_cases: List[LLMTestCase]
+    goldens: List[Golden]
 
     def __init__(self, test_cases: List[LLMTestCase] = []):
         self.test_cases = test_cases
+        self.goldens = []
 
     def add_test_case(self, test_case: LLMTestCase):
         self.test_cases.append(test_case)
@@ -28,7 +37,7 @@ def __iter__(self):
         return iter(self.test_cases)
 
     def evaluate(self, metrics: List[BaseMetric]):
-        from deepeval.evaluator import evaluate
+        from deepeval import evaluate
 
         return evaluate(self.test_cases, metrics)
 
@@ -234,29 +243,45 @@ def push(self, alias: str):
             )
         if os.path.exists(".deepeval"):
             goldens = convert_test_cases_to_goldens(self.test_cases)
-            body = APIDataset(alias=alias, goldens=goldens).model_dump(
-                by_alias=True, exclude_none=True
-            )
+            body = APIDataset(
+                alias=alias, overwrite=False, goldens=goldens
+            ).model_dump(by_alias=True, exclude_none=True)
             api = Api()
             result = api.post_request(
-                endpoint=Endpoints.CREATE_DATASET_ENDPOINT.value,
+                endpoint=Endpoints.DATASET_ENDPOINT.value,
                 body=body,
             )
-            response = CreateDatasetHttpResponse(
-                link=result["link"],
-            )
-            link = response.link
-            console = Console()
-            console.print(
-                "✅ Dataset pushed to Confidnet AI! View on "
-                f"[link={link}]{link}[/link]"
-            )
-            # webbrowser.open(link)
+            if result:
+                response = CreateDatasetHttpResponse(
+                    link=result["link"],
+                )
+                link = response.link
+                console = Console()
+                console.print(
+                    "✅ Dataset successfully pushed to Confidnet AI! View at "
+                    f"[link={link}]{link}[/link]"
+                )
+                webbrowser.open(link)
         else:
             raise Exception(
                 "To push dataset to Confident AI, run `deepeval login`"
             )
 
-    # TODO
     def pull(self, alias: str):
-        pass
+        if os.path.exists(".deepeval"):
+            api = Api()
+            result = api.get_request(
+                endpoint=Endpoints.DATASET_ENDPOINT.value,
+                params={"alias": alias},
+            )
+            response = DatasetHttpResponse(
+                goldens=result["goldens"],
+            )
+            self.goldens.extend(response.goldens)
+
+            # TODO: make this conversion at evaluation time instead
+            self.test_cases.extend(convert_goldens_to_test_cases(self.goldens))
+        else:
+            raise Exception(
+                "Run `deepeval login` to pull dataset from Confident AI"
+            )
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		__version__: str = "0.20.24"
		__version__: str = "0.20.29"