Skip to content

Commit

Permalink
feat: add LLM as judge evaluations (#960)
Browse files Browse the repository at this point in the history
* add claude llm judge model
* add correctness and annotation relevancy metrics
* update root README and evals README
* add QA eval runner
* make eval runners more flexible with env vars
  • Loading branch information
jalling97 authored Sep 17, 2024
1 parent 2d9cff2 commit 3e5f1e0
Show file tree
Hide file tree
Showing 13 changed files with 689 additions and 45 deletions.
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
- [UI](#ui)
- [Backends](#backends)
- [Repeater](#repeater)
- [Evaluations](#evaluations)
- [Usage](#usage)
- [Local Development](#local-development)
- [Contributing](#contributing)
Expand Down Expand Up @@ -58,6 +59,7 @@ The LeapfrogAI repository follows a monorepo structure based around an [API](#ap
leapfrogai/
├── src/
│ ├── leapfrogai_api/ # source code for the API
│ ├── leapfrogai_evals/ # source code for the LeapfrogAI evaluation framework
│ ├── leapfrogai_sdk/ # source code for the SDK
│ └── leapfrogai_ui/ # source code for the UI
├── packages/
Expand Down Expand Up @@ -115,6 +117,10 @@ LeapfrogAI provides several backends for a variety of use cases. Below is the ba

The [repeater](packages/repeater/) "model" is a basic "backend" that parrots all inputs it receives back to the user. It is built out the same way all the actual backends are and it is primarily used for testing the API.

### Evaluations

LeapfrogAI comes with an evaluation framework that is integrated with [DeepEval](https://docs.confident-ai.com/). For more information on running and utilizing evaluations in LeapfrogAI, please see the [Evals README](/src/leapfrogai_evals/README.md).

### Flavors

Each component has different images and values that refer to a specific image registry and/or hardening source. These images are packaged using [Zarf Flavors](https://docs.zarf.dev/ref/examples/package-flavors/):
Expand Down
28 changes: 28 additions & 0 deletions src/leapfrogai_evals/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
LEAPFROGAI_API_URL="https://leapfrogai-api.uds.dev/openai/v1"
LEAPFROGAI_API_KEY="lfai-api-key"
ANTHROPIC_API_KEY="anthropic-api-key"

# ---- hyperparameters ----
# general
MODEL_TO_EVALUATE=vllm
TEMPERATURE=0.1
LLM_JUDGE=ClaudeSonnet

# Needle in a Haystack
NIAH_DATASET=defenseunicorns/LFAI_RAG_niah_v1
NIAH_ADD_PADDING=True
NIAH_MESSAGE_PROMPT="What is the secret code?"
NIAH_INSTRUCTION_TEMPLATE=DEFAULT_INSTRUCTION_TEMPLATE # this can be either a global or a string
NIAH_MIN_DOC_LENGTH=4096
NIAH_MAX_DOC_LENGTH=4096
NIAH_MIN_DEPTH=0.0
NIAH_MAX_DEPTH=1.0
NIAH_NUM_COPIES=2

# Question & Answering
QA_DATASET=defenseunicorns/LFAI_RAG_qa_v1
QA_INSTRUCTION_TEMPLATE=DEFAULT_INSTRUCTION_TEMPLATE # this can be either a global or a string
QA_NUM_SAMPLES=25
QA_NUM_DOCUMENTS=5
#QA_VECTOR_STORE_ID= # set this to a vectore store id if you want to use an already existing vector store with the files present
QA_CLEANUP_VECTOR_STORE=True # recommend setting this to False if a vector store id is provided
50 changes: 48 additions & 2 deletions src/leapfrogai_evals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,20 @@ The LeapfrogAI RAG evaluation system assumes the following:

- LeapfrogAI is deployed
- A valid LeapfrogAI API key is set (for more info, see the [API README](/src/leapfrogai_api/README.md))
- For all LLM-enabled metrics, a valid Anthropic API key is set

Set the following environment variables:
For the easiest setup, copy the `.env.example` file:

```bash
cp .env.example .env
```

Within `.env`, replace the necessary environment variables:

```bash
LEAPFROGAI_API_URL=<LeapfrogAI API url, usually: https://leapfrogai-api.uds.dev/openai/v1 for development>
LEAPFROGAI_API_KEY=<LeapfrogAI API key>
MODEL_TO_EVALUATE="vllm" # can also be provided as "model" to the __init__ for the runner
ANTHROPIC_API_KEY=<Anthropic API key>
```

Running `main.py` will by default run all of the evaluations currently available:
Expand All @@ -24,6 +31,45 @@ python -m pip install .
python main.py
```

## Question/Answer Evaluation

Question and answer pairs are a valuable setup for evaluating LLM systems as a hole. Within LeapfrogAI, this type of evaluation takes an input question, expected context, and expected output, and compares them to the retrieved context from RAG and the system's final output.

### Data
The LeapfrogAI QA evaluation uses a custom dataset available on HuggingFace: [defenseunicorns/LFAI_RAG_qa_v1](https://huggingface.co/datasets/defenseunicorns/LFAI_RAG_qa_v1)

LFAI_RAG_qa_v1 contains 36 question/answer/context entries that are intended to be used for LLM-as-a-judge enabled RAG Evaluations.

Example:

```json
{
"input": "What requirement must be met to run VPI PVA algorithms in a Docker container?",
"actual_output": null,
"expected_output": "To run VPI PVA algorithms in a Docker container, the same VPI version must be installed on the Docker host.",
"context": [
"2.6.\nCompute\nStack\nThe\nfollowing\nDeep\nLearning-related\nissues\nare\nnoted\nin\nthis\nrelease.\nIssue\nDescription\n4564075\nTo\nrun\nVPI\nPVA\nalgorithms\nin\na\ndocker\ncontainer,\nthe\nsame\nVPI\nversion\nhas\nto\nbe\ninstalled\non \nthe\ndocker\nhost.\n2.7.\nDeepstream\nIssue\nDescription\n4325898\nThe\npipeline\ngets\nstuck\nfor\nmulti\u0000lesrc\nwhen\nusing\nnvv4l2decoder.\nDS\ndevelopers\nuse \nthe\npipeline\nto\nrun\ndecode\nand\ninfer\njpeg\nimages.\nNVIDIA\nJetson\nLinux\nRelease\nNotes\nRN_10698-r36.3\n|\n11"
],
"source_file": "documents/Jetson_Linux_Release_Notes_r36.3.pdf"
}
```

### Experimental Design
The LeapfrogAI QA evaluation uses the following process:

- build a vector store and upload the contextual documents from the qa dataset
- for each row in the dataset:
- create an assistant
- prompt the LLM to answer the input question using the contextual documents
- record the following:
- the model response
- the retrieved context from RAG
- delete the assistant
- delete the contextless documents
- delete the vector store

Various metrics can then be calculated using these individual pieces.

## Needle in a Haystack (NIAH)

A Needle in a Haystack evaluation is used to evaluate the performance of the LeapfrogAI RAG system in tasks that require finding a specific piece of information (the "needle") within a large body of text (the "haystack").
Expand Down
Empty file.
55 changes: 55 additions & 0 deletions src/leapfrogai_evals/judges/claude_sonnet.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import os

import instructor
from pydantic import BaseModel
from deepeval.models.base_model import DeepEvalBaseLLM
import asyncio
from anthropic import Anthropic
from typing import Optional


class ClaudeSonnet(DeepEvalBaseLLM):
"""A DeepEval LLM class that utilizes the Anthropic API to utilize Claude models"""

def __init__(
self, api_key: Optional[str] = None, model: str = "claude-3-5-sonnet-20240620"
):
self.model = model
self.client = Anthropic(api_key=api_key or os.environ.get("ANTHROPIC_API_KEY"))

def load_model(self):
"""Returns the current model selected"""
return self.model

def generate(
self,
prompt: str,
schema: BaseModel,
max_tokens: int = 1024,
) -> BaseModel:
"""Generates a response from the Anthropic API"""
instructor_client = instructor.from_anthropic(self.client)
response = instructor_client.messages.create(
model=self.model,
max_tokens=max_tokens,
messages=[
{
"role": "user",
"content": prompt,
}
],
response_model=schema,
)
return response

async def a_generate(
self, prompt: str, schema: BaseModel, *args, **kwargs
) -> BaseModel:
"""Async implementation of the generate function"""
loop = asyncio.get_running_loop()
return await loop.run_in_executor(
None, self.generate, prompt, schema, *args, **kwargs
)

def get_model_name(self):
return f"Anthropic {self.model}"
138 changes: 120 additions & 18 deletions src/leapfrogai_evals/main.py
Original file line number Diff line number Diff line change
@@ -1,40 +1,73 @@
import deepeval
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

import logging
import numpy as np
import os
from dotenv import load_dotenv
import time
from typing import Optional, List

from leapfrogai_evals.runners.niah_runner import NIAH_Runner
from leapfrogai_evals.judges.claude_sonnet import ClaudeSonnet # noqa
from leapfrogai_evals.metrics.annotation_relevancy import AnnotationRelevancyMetric
from leapfrogai_evals.metrics.correctness import CorrectnessMetric
from leapfrogai_evals.metrics.niah_metrics import NIAH_Retrieval, NIAH_Response
from leapfrogai_evals.runners.niah_runner import NIAH_Runner
from leapfrogai_evals.runners.qa_runner import QA_Runner

ALL_EVALS = ["LFAI_NIAH"]
ALL_EVALS = ["niah_eval", "qa_eval"]


class RAGEvaluator:
"""A class that handles running all of the LeapfrogAI RAG evaluations"""

def __init__(self):
self.eval_list = None
def __init__(
self,
eval_list: Optional[List[str]] = None,
):
self.eval_list = eval_list
self.test_case_dict = None
self.niah_test_cases = None
self.eval_options = ALL_EVALS
self.eval_results = dict()

def set_evaluations(self, evals_list=[]) -> None:
def set_evaluations(self, eval_list: List[str] = None) -> None:
"""Set the evaluations that will be run via a list"""
if len(evals_list) == 0:
if not eval_list:
logging.info("Setting eval list to ALL")
self.eval_list = ALL_EVALS
# TODO: Add other evals options
else:
for item in eval_list:
if item not in ALL_EVALS:
raise AttributeError(
f"'{item}' is not an available evaluation. Please limit the list to one of the following: {ALL_EVALS}"
)
self.eval_list = eval_list

def run_evals(self, *args, **kwargs) -> None:
"""Run all of the selected evaluations"""
if self.eval_list is None:
raise AttributeError(
"the list of evaluations has not been set. Please do so by running the 'set_evaluations()' function"
)

logging.info("Running the following evaluations:")
for eval in self.eval_list:
logging.info(f" -{eval}")
if "LFAI_NIAH" in self.eval_list:
self._niah_evaluation(*args, **kwargs)
# TODO: add more evaluations
logging.info("".join([f"\n - {eval_name}" for eval_name in self.eval_list]))

def _niah_evaluation(self, *args, **kwargs) -> None:
start_time = time.time()
for eval_name in self.eval_list:
eval = getattr(self, eval_name)
eval(*args, **kwargs)
end_time = time.time()

self.eval_results["Eval Execution Runtime (seconds)"] = end_time - start_time

logging.info("\n\nFinal Results:")
for key, value in self.eval_results.items():
logging.info(f"{key}: {value}")

def niah_eval(self, *args, **kwargs) -> None:
"""Run the Needle in a Haystack evaluation"""
logging.info("Beginning Needle in a Haystack Evaluation...")
self.niah_test_cases = []

niah_runner = NIAH_Runner(*args, **kwargs)
Expand All @@ -55,16 +88,85 @@ def _niah_evaluation(self, *args, **kwargs) -> None:
)

# run metrics
# TODO: Give ability to choose which metrics to run
retrieval_metric = NIAH_Retrieval()
response_metric = NIAH_Response()
metrics = [retrieval_metric, response_metric]

for metric in metrics:
scores = []
successes = []
for test_case in self.niah_test_cases:
metric.measure(test_case)
scores.append(metric.score)
successes.append(metric.is_successful())
self.eval_results[f"Average {metric.__name__}"] = np.mean(scores)
logging.info(f"{metric.__name__} Results:")
logging.info(f"average score: {np.mean(scores)}")
logging.info(f"scores: {scores}")
logging.info(f"successes: {successes}")

def qa_eval(self, *args, **kwargs) -> None:
"""Runs the Question/Answer evaluation"""
logging.info("Beginning Question/Answer Evaluation...")
self.qa_test_cases = []

qa_runner = QA_Runner(*args, **kwargs)
qa_runner.run_experiment()

# build test cases out of the qa_dataset
for row in qa_runner.qa_data:
self.qa_test_cases.append(
LLMTestCase(
input=row["input"],
actual_output=row["actual_output"],
context=row["context"],
expected_output=row["expected_output"],
additional_metadata={
"actual_annotations": row["actual_annotations"],
"expected_annotations": row["expected_annotations"],
},
# retrieval_context = row['retrieval_context'] # TODO: add this for more metrics
)
)

# Create judge llm
try:
judge_model = globals()[os.environ.get("LLM_JUDGE")]()
except KeyError:
judge_model = os.environ.get("LLM_JUDGE")

# run metrics
# TODO: Give ability to choose which metrics to run
correctness_metric = CorrectnessMetric(model=judge_model)
answer_relevancy_metric = AnswerRelevancyMetric(model=judge_model)
annotation_relevancy_metric = AnnotationRelevancyMetric()
metrics = [
correctness_metric,
answer_relevancy_metric,
annotation_relevancy_metric,
]

deepeval.evaluate(
test_cases=self.niah_test_cases, metrics=[retrieval_metric, response_metric]
)
for metric in metrics:
scores = []
successes = []
reasons = []
for test_case in self.qa_test_cases:
metric.measure(test_case)
scores.append(metric.score)
successes.append(metric.is_successful())
reasons.append(metric.reason)
self.eval_results[f"Average {metric.__name__}"] = np.mean(scores)
logging.info(f"{metric.__name__} Results:")
logging.info(f"average score: {np.mean(scores)}")
logging.info(f"scores: {scores}")
logging.info(f"successes: {successes}")
logging.info(f"reasons: {reasons}")


if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
load_dotenv()
evaluator = RAGEvaluator()
evaluator.set_evaluations()
evaluator.run_evals()
Loading

0 comments on commit 3e5f1e0

Please sign in to comment.