Merge pull request #27 from confident-ai/main

Merge from main.
Anindyadeep · Jan 14, 2024 · 6e56531 · 6e56531
2 parents 2109a70 + 6b4e6e9
commit 6e56531
Show file tree

Hide file tree

Showing 74 changed files with 1,682 additions and 829 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/README.md b/README.md
@@ -18,24 +18,28 @@
     </a>
 </p>
 
-**DeepEval** is a simple-to-use, open-source evaluation framework for LLM applications. It is similar to Pytest but specialized for unit testing LLM applications. DeepEval evaluates performance based on metrics such as hallucination, answer relevancy, RAGAS, etc., using LLMs and various other NLP models **locally on your machine**.
+**DeepEval** is a simple-to-use, open-source LLM evaluation framework for LLM applications. It is similar to Pytest but specialized for unit testing LLM applications. DeepEval evaluates performance based on metrics such as hallucination, answer relevancy, RAGAS, etc., using LLMs and various other NLP models that runs **locally on your machine**.
 
 Whether your application is implemented via RAG or fine-tuning, LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal hyperparameters to improve your RAG pipeline, prevent prompt drifting, or even transition from OpenAI to hosting your own Llama2 with confidence.
 
 <br />
 
 # Features
 
-- Large variety of ready-to-use evaluation metrics powered by LLMs, statistical methods, or NLP models that runs **locally on your machine**:
+- Large variety of ready-to-use LLM evaluation metrics powered by LLMs (all with explanations), statistical methods, or NLP models that runs **locally on your machine**:
   - Hallucination
+  - Summarization
   - Answer Relevancy
+  - Faithfulness
+  - Contextual Recall
+  - Contextual Precision
   - RAGAS
   - G-Eval
   - Toxicity
   - Bias
   - etc.
+- Evaluate your entire dataset in bulk in under 20 lines of Python code **in parallel**. Do this via the CLI in a Pytest-like manner, or through our `evaluate()` function.
 - Easily create your own custom metrics that are automatically integrated with DeepEval's ecosystem by inheriting DeepEval's base metric class.
-- Evaluate your entire dataset in bulk using fewer than 20 lines of Python code **in parallel**.
 - [Automatically integrated with Confident AI](https://app.confident-ai.com) for continous evaluation throughout the lifetime of your LLM (app):
   - log evaluation results and analyze metrics pass / fails
   - compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results
@@ -90,7 +94,7 @@ def test_case():
 
     # Replace this with the actual output from your LLM application
     actual_output = "We offer a 30-day full refund at no extra costs."
-    hallucination_metric = HallucinationMetric(minimum_score=0.7)
+    hallucination_metric = HallucinationMetric(threshold=0.7)
     test_case = LLMTestCase(input=input, actual_output=actual_output, context=context)
     assert_test(test_case, [hallucination_metric])
 ```
@@ -104,13 +108,36 @@ deepeval test run test_chatbot.py
 **Your test should have passed ✅** Let's breakdown what happened.
 
 - The variable `input` mimics user input, and `actual_output` is a placeholder for your chatbot's intended output based on this query.
-- The variable `context` contains the relevant information from your knowledge base, and `HallucinationMetric(minimum_score=0.7)` is an out-of-the-box metric provided by DeepEval. It helps you evaluate the factual accuracy of your chatbot's output based on the provided context.
-- The metric score ranges from 0 - 1. The `minimum_score=0.7` threshold ultimately determines whether your test has passed or not.
+- The variable `context` contains the relevant information from your knowledge base, and `HallucinationMetric(threshold=0.7)` is an out-of-the-box metric provided by DeepEval. It helps you evaluate the factual accuracy of your chatbot's output based on the provided context.
+- The metric score ranges from 0 - 1. The `threshold=0.7` threshold ultimately determines whether your test has passed or not.
 
 [Read our documentation](https://docs.confident-ai.com/docs/getting-started) for more information on how to use additional metrics, create your own custom metrics, and tutorials on how to integrate with other tools like LangChain and LlamaIndex.
 
 <br />
 
+## Evaluating Without Pytest Integration
+
+Alternatively, you can evaluate without Pytest, which is more suited for a notebook environment.
+
+```python
+from deepeval import evaluate
+from deepeval.metrics import HallucinationMetric
+from deepeval.test_case import LLMTestCase
+
+input = "What if these shoes don't fit?"
+context = ["All customers are eligible for a 30 day full refund at no extra costs."]
+# Replace this with the actual output from your LLM application
+actual_output = "We offer a 30-day full refund at no extra costs."
+
+hallucination_metric = HallucinationMetric(threshold=0.7)
+test_case = LLMTestCase(
+    input=input,
+    actual_output=actual_output,
+    context=context
+)
+evaluate([test_case], [hallucination_metric])
+```
+
 ## Evaluting a Dataset / Test Cases in Bulk
 
 In DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate things in bulk:
@@ -132,8 +159,8 @@ dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])
     dataset,
 )
 def test_customer_chatbot(test_case: LLMTestCase):
-    hallucination_metric = HallucinationMetric(minimum_score=0.3)
-    answer_relevancy_metric = AnswerRelevancyMetric(minimum_score=0.5)
+    hallucination_metric = HallucinationMetric(threshold=0.3)
+    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
     assert_test(test_case, [hallucination_metric, answer_relevancy_metric])
 ```
 
@@ -144,7 +171,7 @@ deepeval test run test_<filename>.py -n 4
 
 <br/>
 
-Alternatively, although we recommend using `deepeval test run`, you can evaluate a dataset/test cases without using pytest:
+Alternatively, although we recommend using `deepeval test run`, you can evaluate a dataset/test cases without using our Pytest integration:
 
 ```python
 from deepeval import evaluate
@@ -164,6 +191,7 @@ We offer a [free web platform](https://app.confident-ai.com) for you to:
 3. Compare and pick the optimal hyperparameteres (prompt templates, models, chunk size, etc.).
 4. Create, manage, and centralize your evaluation datasets.
 5. Track events in production and augment your evaluation dataset for continous evaluation in production.
+6. Track events in production and view live evaluation results over time.
 
 Everything on Confident AI, including how to use Confident is available [here](https://docs.confident-ai.com/docs/confident-ai-introduction).
 

diff --git a/deepeval/__init__.py b/deepeval/__init__.py
@@ -2,13 +2,13 @@
 import re
 
 # Optionally add telemtry
-from .telemetry import *
 from ._version import __version__
 
 from .decorators.hyperparameters import set_hyperparameters
 from deepeval.event import track
 from deepeval.evaluate import evaluate, run_test, assert_test
 from deepeval.test_run import on_test_run_end
+from deepeval.telemetry import *
 
 __all__ = [
     "set_hyperparameters",

diff --git a/deepeval/_version.py b/deepeval/_version.py
@@ -1 +1 @@
-__version__: str = "0.20.42"
+__version__: str = "0.20.46"
diff --git a/deepeval/api.py b/deepeval/api.py
@@ -4,9 +4,13 @@
 import requests
 import warnings
 from requests.adapters import HTTPAdapter, Response, Retry
+import aiohttp
+from aiohttp import ClientSession
+from requests.adapters import HTTPAdapter
+from enum import Enum
+
 from deepeval.constants import API_KEY_ENV
 from deepeval.key_handler import KEY_FILE_HANDLER, KeyValues
-from enum import Enum
 
 API_BASE_URL = "https://app.confident-ai.com/api"
 
@@ -259,3 +263,77 @@ def quote_string(text: str) -> str:
             str: Quoted text in return
         """
         return urllib.parse.quote(text, safe="")
+
+    async def _api_request_async(
+        self,
+        method,
+        endpoint,
+        headers=None,
+        auth=None,
+        params=None,
+        body=None,
+        files=None,
+        data=None,
+    ):
+        """Generic asynchronous HTTP request method with error handling."""
+        url = f"{self.base_api_url}/{endpoint}"
+        async with ClientSession() as session:
+            try:
+                # Preparing the request body for file uploads if files are present
+                if files:
+                    data = aiohttp.FormData()
+                    for file_name, file_content in files.items():
+                        data.add_field(
+                            file_name, file_content, filename=file_name
+                        )
+
+                # Sending the request
+                res = await session.request(
+                    method=method,
+                    url=url,
+                    headers=headers,
+                    params=params,
+                    json=body,
+                )
+
+                # Check response status
+                if res.status == 200:
+                    try:
+                        json = await res.json()
+                        return json
+                    except ValueError:
+                        return await res.text()
+                else:
+                    # Determine how to process the response based on Content-Type
+                    content_type = res.headers.get("Content-Type", "")
+                    if "application/json" in content_type:
+                        error_response = await res.json()
+                    else:
+                        error_response = await res.text()
+
+                    # Specifically handle status code 400
+                    if res.status == 400:
+                        print(f"Error 400: Bad Request - {error_response}")
+
+                    print(f"Error {res.status}: {error_response}")
+                    return None
+
+            except Exception as err:
+                raise Exception(f"HTTP request failed: {err}") from err
+
+    async def post_request_async(
+        self, endpoint, body=None, files=None, data=None
+    ):
+        """Generic asynchronous POST Request Wrapper"""
+        print("hi")
+        return await self._api_request_async(
+            "POST",
+            endpoint,
+            headers=self._headers
+            if files is None
+            else self._headers_multipart_form_data,
+            auth=self._auth,
+            body=body,
+            files=files,
+            data=data,
+        )
diff --git a/deepeval/chat_completion/retry.py b/deepeval/chat_completion/retry.py
@@ -1,19 +1,47 @@
-from typing import Callable, Any
+import random
 import time
+import openai
 
 
-def call_openai_with_retry(
-    callable: Callable[[], Any], max_retries: int = 2
-) -> Any:
-    for _ in range(max_retries):
-        try:
-            response = callable()
-            return response
-        except Exception as e:
-            print(f"An error occurred: {e}. Retrying...")
-            time.sleep(2)
-            continue
-
-    raise Exception(
-        "Max retries reached. Unable to make a successful API call to OpenAI."
-    )
+def retry_with_exponential_backoff(
+    func,
+    initial_delay: float = 1,
+    exponential_base: float = 2,
+    jitter: bool = True,
+    max_retries: int = 10,
+    errors: tuple = (openai.RateLimitError,),
+):
+    """Retry a function with exponential backoff."""
+
+    def wrapper(*args, **kwargs):
+        # Initialize variables
+        num_retries = 0
+        delay = initial_delay
+
+        # Loop until a successful response or max_retries is hit or an exception is raised
+        while True:
+            try:
+                return func(*args, **kwargs)
+
+            # Retry on specified errors
+            except errors as e:
+                # Increment retries
+                num_retries += 1
+
+                # Check if max retries has been reached
+                if num_retries > max_retries:
+                    raise Exception(
+                        f"Maximum number of retries ({max_retries}) exceeded."
+                    )
+
+                # Increment the delay
+                delay *= exponential_base * (1 + jitter * random.random())
+
+                # Sleep for the delay
+                time.sleep(delay)
+
+            # Raise exceptions for any errors not specified
+            except Exception as e:
+                raise e
+
+    return wrapper
diff --git a/deepeval/cli/azure_openai.py b/deepeval/cli/azure_openai.py
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		__version__: str = "0.20.42"
		__version__: str = "0.20.46"