feat: add LLM as judge evaluations (#960)

* add claude llm judge model * add correctness and annotation relevancy metrics * update root README and evals README * add QA eval runner * make eval runners more flexible with env vars
defenseunicorns · Sep 17, 2024 · 3e5f1e0 · 3e5f1e0
1 parent 2d9cff2
commit 3e5f1e0
Show file tree

Hide file tree

Showing 13 changed files with 689 additions and 45 deletions.
diff --git a/README.md b/README.md
@@ -15,6 +15,7 @@
   - [UI](#ui)
   - [Backends](#backends)
     - [Repeater](#repeater)
+  - [Evaluations](#evaluations)
 - [Usage](#usage)
 - [Local Development](#local-development)
 - [Contributing](#contributing)
@@ -58,6 +59,7 @@ The LeapfrogAI repository follows a monorepo structure based around an [API](#ap
 leapfrogai/
 ├── src/
 │   ├── leapfrogai_api/   # source code for the API
+│   ├── leapfrogai_evals/ # source code for the LeapfrogAI evaluation framework
 │   ├── leapfrogai_sdk/   # source code for the SDK
 │   └── leapfrogai_ui/    # source code for the UI
 ├── packages/
@@ -115,6 +117,10 @@ LeapfrogAI provides several backends for a variety of use cases. Below is the ba
 
 The [repeater](packages/repeater/) "model" is a basic "backend" that parrots all inputs it receives back to the user. It is built out the same way all the actual backends are and it is primarily used for testing the API.
 
+### Evaluations
+
+LeapfrogAI comes with an evaluation framework that is integrated with [DeepEval](https://docs.confident-ai.com/). For more information on running and utilizing evaluations in LeapfrogAI, please see the [Evals README](/src/leapfrogai_evals/README.md).
+
 ### Flavors
 
 Each component has different images and values that refer to a specific image registry and/or hardening source. These images are packaged using [Zarf Flavors](https://docs.zarf.dev/ref/examples/package-flavors/):

diff --git a/src/leapfrogai_evals/.env.example b/src/leapfrogai_evals/.env.example
@@ -0,0 +1,28 @@
+LEAPFROGAI_API_URL="https://leapfrogai-api.uds.dev/openai/v1"
+LEAPFROGAI_API_KEY="lfai-api-key"
+ANTHROPIC_API_KEY="anthropic-api-key"
+
+# ---- hyperparameters ----
+# general
+MODEL_TO_EVALUATE=vllm
+TEMPERATURE=0.1
+LLM_JUDGE=ClaudeSonnet
+
+# Needle in a Haystack
+NIAH_DATASET=defenseunicorns/LFAI_RAG_niah_v1
+NIAH_ADD_PADDING=True
+NIAH_MESSAGE_PROMPT="What is the secret code?"
+NIAH_INSTRUCTION_TEMPLATE=DEFAULT_INSTRUCTION_TEMPLATE # this can be either a global or a string
+NIAH_MIN_DOC_LENGTH=4096
+NIAH_MAX_DOC_LENGTH=4096
+NIAH_MIN_DEPTH=0.0
+NIAH_MAX_DEPTH=1.0
+NIAH_NUM_COPIES=2
+
+# Question & Answering
+QA_DATASET=defenseunicorns/LFAI_RAG_qa_v1
+QA_INSTRUCTION_TEMPLATE=DEFAULT_INSTRUCTION_TEMPLATE # this can be either a global or a string
+QA_NUM_SAMPLES=25
+QA_NUM_DOCUMENTS=5
+#QA_VECTOR_STORE_ID= # set this to a vectore store id if you want to use an already existing vector store with the files present
+QA_CLEANUP_VECTOR_STORE=True # recommend setting this to False if a vector store id is provided
diff --git a/src/leapfrogai_evals/README.md b/src/leapfrogai_evals/README.md
@@ -7,13 +7,20 @@ The LeapfrogAI RAG evaluation system assumes the following:
 
 - LeapfrogAI is deployed
 - A valid LeapfrogAI API key is set (for more info, see the [API README](/src/leapfrogai_api/README.md))
+- For all LLM-enabled metrics, a valid Anthropic API key is set
 
-Set the following environment variables:
+For the easiest setup, copy the `.env.example` file:
+
+```bash
+cp .env.example .env
+```
+
+Within `.env`, replace the necessary environment variables:
 
 ```bash
 LEAPFROGAI_API_URL=<LeapfrogAI API url, usually: https://leapfrogai-api.uds.dev/openai/v1 for development>
 LEAPFROGAI_API_KEY=<LeapfrogAI API key>
-MODEL_TO_EVALUATE="vllm" # can also be provided as "model" to the __init__ for the runner
+ANTHROPIC_API_KEY=<Anthropic API key>
 ```
 
 Running `main.py` will by default run all of the evaluations currently available:
@@ -24,6 +31,45 @@ python -m pip install .
 python main.py
 ```
 
+## Question/Answer Evaluation
+
+Question and answer pairs are a valuable setup for evaluating LLM systems as a hole. Within LeapfrogAI, this type of evaluation takes an input question, expected context, and expected output, and compares them to the retrieved context from RAG and the system's final output.
+
+### Data
+The LeapfrogAI QA evaluation uses a custom dataset available on HuggingFace: [defenseunicorns/LFAI_RAG_qa_v1](https://huggingface.co/datasets/defenseunicorns/LFAI_RAG_qa_v1)
+
+LFAI_RAG_qa_v1 contains 36 question/answer/context entries that are intended to be used for LLM-as-a-judge enabled RAG Evaluations.
+
+Example:
+
+```json
+{
+    "input": "What requirement must be met to run VPI PVA algorithms in a Docker container?",
+    "actual_output": null,
+    "expected_output": "To run VPI PVA algorithms in a Docker container, the same VPI version must be installed on the Docker host.",
+    "context": [
+        "2.6.\nCompute\nStack\nThe\nfollowing\nDeep\nLearning-related\nissues\nare\nnoted\nin\nthis\nrelease.\nIssue\nDescription\n4564075\nTo\nrun\nVPI\nPVA\nalgorithms\nin\na\ndocker\ncontainer,\nthe\nsame\nVPI\nversion\nhas\nto\nbe\ninstalled\non \nthe\ndocker\nhost.\n2.7.\nDeepstream\nIssue\nDescription\n4325898\nThe\npipeline\ngets\nstuck\nfor\nmulti\u0000lesrc\nwhen\nusing\nnvv4l2decoder.\nDS\ndevelopers\nuse \nthe\npipeline\nto\nrun\ndecode\nand\ninfer\njpeg\nimages.\nNVIDIA\nJetson\nLinux\nRelease\nNotes\nRN_10698-r36.3\n|\n11"
+    ],
+    "source_file": "documents/Jetson_Linux_Release_Notes_r36.3.pdf"
+}
+```
+
+### Experimental Design
+The LeapfrogAI QA evaluation uses the following process:
+
+- build a vector store and upload the contextual documents from the qa dataset
+- for each row in the dataset:
+    - create an assistant
+    - prompt the LLM to answer the input question using the contextual documents
+    - record the following:
+        - the model response
+        - the retrieved context from RAG
+    - delete the assistant
+- delete the contextless documents
+- delete the vector store
+
+Various metrics can then be calculated using these individual pieces.
+
 ## Needle in a Haystack (NIAH)
 
 A Needle in a Haystack evaluation is used to evaluate the performance of the LeapfrogAI RAG system in tasks that require finding a specific piece of information (the "needle") within a large body of text (the "haystack").

diff --git a/src/leapfrogai_evals/judges/__init__.py b/src/leapfrogai_evals/judges/__init__.py
diff --git a/src/leapfrogai_evals/judges/claude_sonnet.py b/src/leapfrogai_evals/judges/claude_sonnet.py
@@ -0,0 +1,55 @@
+import os
+
+import instructor
+from pydantic import BaseModel
+from deepeval.models.base_model import DeepEvalBaseLLM
+import asyncio
+from anthropic import Anthropic
+from typing import Optional
+
+
+class ClaudeSonnet(DeepEvalBaseLLM):
+    """A DeepEval LLM class that utilizes the Anthropic API to utilize Claude models"""
+
+    def __init__(
+        self, api_key: Optional[str] = None, model: str = "claude-3-5-sonnet-20240620"
+    ):
+        self.model = model
+        self.client = Anthropic(api_key=api_key or os.environ.get("ANTHROPIC_API_KEY"))
+
+    def load_model(self):
+        """Returns the current model selected"""
+        return self.model
+
+    def generate(
+        self,
+        prompt: str,
+        schema: BaseModel,
+        max_tokens: int = 1024,
+    ) -> BaseModel:
+        """Generates a response from the Anthropic API"""
+        instructor_client = instructor.from_anthropic(self.client)
+        response = instructor_client.messages.create(
+            model=self.model,
+            max_tokens=max_tokens,
+            messages=[
+                {
+                    "role": "user",
+                    "content": prompt,
+                }
+            ],
+            response_model=schema,
+        )
+        return response
+
+    async def a_generate(
+        self, prompt: str, schema: BaseModel, *args, **kwargs
+    ) -> BaseModel:
+        """Async implementation of the generate function"""
+        loop = asyncio.get_running_loop()
+        return await loop.run_in_executor(
+            None, self.generate, prompt, schema, *args, **kwargs
+        )
+
+    def get_model_name(self):
+        return f"Anthropic {self.model}"
diff --git a/src/leapfrogai_evals/main.py b/src/leapfrogai_evals/main.py
@@ -1,40 +1,73 @@
-import deepeval
 from deepeval.test_case import LLMTestCase
+from deepeval.metrics import AnswerRelevancyMetric
+
 import logging
+import numpy as np
+import os
+from dotenv import load_dotenv
+import time
+from typing import Optional, List
 
-from leapfrogai_evals.runners.niah_runner import NIAH_Runner
+from leapfrogai_evals.judges.claude_sonnet import ClaudeSonnet  # noqa
+from leapfrogai_evals.metrics.annotation_relevancy import AnnotationRelevancyMetric
+from leapfrogai_evals.metrics.correctness import CorrectnessMetric
 from leapfrogai_evals.metrics.niah_metrics import NIAH_Retrieval, NIAH_Response
+from leapfrogai_evals.runners.niah_runner import NIAH_Runner
+from leapfrogai_evals.runners.qa_runner import QA_Runner
 
-ALL_EVALS = ["LFAI_NIAH"]
+ALL_EVALS = ["niah_eval", "qa_eval"]
 
 
 class RAGEvaluator:
     """A class that handles running all of the LeapfrogAI RAG evaluations"""
 
-    def __init__(self):
-        self.eval_list = None
+    def __init__(
+        self,
+        eval_list: Optional[List[str]] = None,
+    ):
+        self.eval_list = eval_list
         self.test_case_dict = None
         self.niah_test_cases = None
-        self.eval_options = ALL_EVALS
+        self.eval_results = dict()
 
-    def set_evaluations(self, evals_list=[]) -> None:
+    def set_evaluations(self, eval_list: List[str] = None) -> None:
         """Set the evaluations that will be run via a list"""
-        if len(evals_list) == 0:
+        if not eval_list:
             logging.info("Setting eval list to ALL")
             self.eval_list = ALL_EVALS
-        # TODO: Add other evals options
+        else:
+            for item in eval_list:
+                if item not in ALL_EVALS:
+                    raise AttributeError(
+                        f"'{item}' is not an available evaluation. Please limit the list to one of the following: {ALL_EVALS}"
+                    )
+            self.eval_list = eval_list
 
     def run_evals(self, *args, **kwargs) -> None:
         """Run all of the selected evaluations"""
+        if self.eval_list is None:
+            raise AttributeError(
+                "the list of evaluations has not been set. Please do so by running the 'set_evaluations()' function"
+            )
+
         logging.info("Running the following evaluations:")
-        for eval in self.eval_list:
-            logging.info(f" -{eval}")
-        if "LFAI_NIAH" in self.eval_list:
-            self._niah_evaluation(*args, **kwargs)
-        # TODO: add more evaluations
+        logging.info("".join([f"\n - {eval_name}" for eval_name in self.eval_list]))
 
-    def _niah_evaluation(self, *args, **kwargs) -> None:
+        start_time = time.time()
+        for eval_name in self.eval_list:
+            eval = getattr(self, eval_name)
+            eval(*args, **kwargs)
+        end_time = time.time()
+
+        self.eval_results["Eval Execution Runtime (seconds)"] = end_time - start_time
+
+        logging.info("\n\nFinal Results:")
+        for key, value in self.eval_results.items():
+            logging.info(f"{key}: {value}")
+
+    def niah_eval(self, *args, **kwargs) -> None:
         """Run the Needle in a Haystack evaluation"""
+        logging.info("Beginning Needle in a Haystack Evaluation...")
         self.niah_test_cases = []
 
         niah_runner = NIAH_Runner(*args, **kwargs)
@@ -55,16 +88,85 @@ def _niah_evaluation(self, *args, **kwargs) -> None:
             )
 
         # run metrics
+        # TODO: Give ability to choose which metrics to run
         retrieval_metric = NIAH_Retrieval()
         response_metric = NIAH_Response()
+        metrics = [retrieval_metric, response_metric]
+
+        for metric in metrics:
+            scores = []
+            successes = []
+            for test_case in self.niah_test_cases:
+                metric.measure(test_case)
+                scores.append(metric.score)
+                successes.append(metric.is_successful())
+            self.eval_results[f"Average {metric.__name__}"] = np.mean(scores)
+            logging.info(f"{metric.__name__} Results:")
+            logging.info(f"average score: {np.mean(scores)}")
+            logging.info(f"scores: {scores}")
+            logging.info(f"successes: {successes}")
+
+    def qa_eval(self, *args, **kwargs) -> None:
+        """Runs the Question/Answer evaluation"""
+        logging.info("Beginning Question/Answer Evaluation...")
+        self.qa_test_cases = []
+
+        qa_runner = QA_Runner(*args, **kwargs)
+        qa_runner.run_experiment()
+
+        # build test cases out of the qa_dataset
+        for row in qa_runner.qa_data:
+            self.qa_test_cases.append(
+                LLMTestCase(
+                    input=row["input"],
+                    actual_output=row["actual_output"],
+                    context=row["context"],
+                    expected_output=row["expected_output"],
+                    additional_metadata={
+                        "actual_annotations": row["actual_annotations"],
+                        "expected_annotations": row["expected_annotations"],
+                    },
+                    # retrieval_context = row['retrieval_context'] # TODO: add this for more metrics
+                )
+            )
+
+        # Create judge llm
+        try:
+            judge_model = globals()[os.environ.get("LLM_JUDGE")]()
+        except KeyError:
+            judge_model = os.environ.get("LLM_JUDGE")
+
+        # run metrics
+        # TODO: Give ability to choose which metrics to run
+        correctness_metric = CorrectnessMetric(model=judge_model)
+        answer_relevancy_metric = AnswerRelevancyMetric(model=judge_model)
+        annotation_relevancy_metric = AnnotationRelevancyMetric()
+        metrics = [
+            correctness_metric,
+            answer_relevancy_metric,
+            annotation_relevancy_metric,
+        ]
 
-        deepeval.evaluate(
-            test_cases=self.niah_test_cases, metrics=[retrieval_metric, response_metric]
-        )
+        for metric in metrics:
+            scores = []
+            successes = []
+            reasons = []
+            for test_case in self.qa_test_cases:
+                metric.measure(test_case)
+                scores.append(metric.score)
+                successes.append(metric.is_successful())
+                reasons.append(metric.reason)
+            self.eval_results[f"Average {metric.__name__}"] = np.mean(scores)
+            logging.info(f"{metric.__name__} Results:")
+            logging.info(f"average score: {np.mean(scores)}")
+            logging.info(f"scores: {scores}")
+            logging.info(f"successes: {successes}")
+            logging.info(f"reasons: {reasons}")
 
 
 if __name__ == "__main__":
     logging.basicConfig(level=logging.INFO)
+    load_dotenv()
     evaluator = RAGEvaluator()
     evaluator.set_evaluations()
     evaluator.run_evals()