chore(llmobs): implement answer relevancy ragas metric #11738

lievan · 2024-12-16T15:25:00Z

Implements answer relevancy metric for ragas integration.

About Answer Relevancy

Answer relevancy metric focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the question, the retrieved contexts and the answer.

The Answer Relevancy is defined as the mean cosine similarity of the original question to a number of artificial questions, which where generated (reverse engineered) based on the response.

Example trace

Checklist

PR author has checked that all the criteria below are met
The PR description includes an overview of the change
The PR description articulates the motivation for the change
The change includes tests OR the PR description describes a testing strategy
The PR description notes risks associated with the change, if any
Newly-added code is easy to change
The change follows the library release note guidelines
The change includes or references documentation updates if necessary
Backport labels are set (if applicable)

Reviewer Checklist

Reviewer has checked that all the criteria below are met
Title is accurate
All changes are related to the pull request's stated goal
Avoids breaking API changes
Testing strategy adequately addresses listed risks
Newly-added code is easy to change
Release note makes sense to a user of the library
If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
Backport labels are set in a manner that is consistent with the release branch maintenance policy

…levancy

…dd-trace-py into evan.li/remaining-ragas-metrics

…dd-trace-py into evan.li/answer-relevancy

github-actions · 2024-12-16T15:31:51Z

CODEOWNERS have been resolved as:

ddtrace/llmobs/_evaluators/ragas/answer_relevancy.py                    @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_ragas_answer_relevancy_evaluator.emits_traces_and_evaluations_on_exit.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_ragas_evaluators.test_ragas_answer_relevancy_emits_traces.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_ragas_evaluators.test_ragas_answer_relevancy_submits_evaluation.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_ragas_evaluators.test_ragas_answer_relevancy_submits_evaluation_on_span_with_custom_keys.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_ragas_evaluators.test_ragas_answer_relevancy_submits_evaluation_on_span_with_question_in_messages.yaml  @DataDog/ml-observability
ddtrace/llmobs/_evaluators/ragas/base.py                                @DataDog/ml-observability
ddtrace/llmobs/_evaluators/ragas/models.py                              @DataDog/ml-observability
ddtrace/llmobs/_evaluators/runner.py                                    @DataDog/ml-observability
tests/llmobs/_utils.py                                                  @DataDog/ml-observability
tests/llmobs/conftest.py                                                @DataDog/ml-observability
tests/llmobs/test_llmobs_ragas_evaluators.py                            @DataDog/ml-observability

datadog-dd-trace-py-rkomorn · 2024-12-16T15:35:09Z

Datadog Report

Branch report: evan.li/answer-relevancy
Commit report: 812a162
Test service: dd-trace-py

✅ 0 Failed, 558 Passed, 910 Skipped, 7m 20.85s Total duration (28m 50.51s time saved)

pr-commenter · 2024-12-16T16:02:28Z

Benchmarks

Benchmark execution time: 2024-12-16 16:02:25

Comparing candidate commit 812a162 in PR branch evan.li/answer-relevancy with baseline commit 956e32d in branch evan.li/remaining-ragas-metrics.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 394 metrics, 2 unstable metrics.

Yun-Kim · 2024-12-30T22:26:55Z

ddtrace/llmobs/_evaluators/ragas/answer_relevancy.py

+        Raises: NotImplementedError if the ragas library is not found or if ragas version is not supported.
+        """
+        super().__init__(llmobs_service)
+        self.ragas_answer_relevancy_instance = self._get_answer_relevancy_instance()


dumb question - what is the purpose of having a different LLM instance per eval metric runner? Are they not all references to the same base OpenAI() LLM?

Each eval metric runner has access to one instance of a ragas metric (answer_relevancy, context_precision, faithfulness) each of these ragas metrics have a seperate llm attribute. We maintain a reference to the ragas metric, not the llm attribute

but yea, the default is openai for all of them

ddtrace/llmobs/_evaluators/ragas/answer_relevancy.py

Yun-Kim · 2024-12-30T22:28:43Z

ddtrace/llmobs/_evaluators/ragas/answer_relevancy.py

+                question = cp_inputs["question"]
+                contexts = cp_inputs["contexts"]
+                answer = cp_inputs["answer"]


Same comment as before, doesn't seem necessary to explicitly separate into new variables since they're only called once immediately at most

ddtrace/llmobs/_evaluators/ragas/answer_relevancy.py

ddtrace/llmobs/_evaluators/ragas/base.py

Yun-Kim · 2024-12-30T22:31:07Z

ddtrace/llmobs/_evaluators/runner.py

@@ -52,7 +54,7 @@ def __init__(self, interval: float, llmobs_service=None, evaluators=None):
            if evaluator in SUPPORTED_EVALUATORS:
                evaluator_init_state = "ok"
                try:
-                    self.evaluators.append(SUPPORTED_EVALUATORS[evaluator](llmobs_service=llmobs_service))
+                    self.evaluators.append(SUPPORTED_EVALUATORS[evaluator](llmobs_service=llmobs_service))  # noqa: E501


Why is this fmt comment required?

there was a error being raised by mypy due to the use of ABC. I think it was a bug since it only appeared when we have 3 or more evaluators (see: python/mypy#13044). But we are not using an abstract class for BaseRagasEvaluator anyway, so no need for this anymore

...lmobs.test_llmobs_ragas_answer_relevancy_evaluator.emits_traces_and_evaluations_on_exit.yaml

lievan · 2025-01-13T17:18:36Z

Going to re-open another PR (#11915) with latest changes from @main

lievan and others added 20 commits December 10, 2024 23:32

ragas context precision implementation; need to refactor

94dfff1

tests wip

72b71f0

working tests

4d57d83

fix cassette names

06dbc55

refactor to include base ragas eval

cd31162

refactor faithfulness to use base ragas

dfd6c57

nit changes

dd1abb4

Merge branch 'main' into evan.li/remaining-ragas-metrics

c477f2b

touch up comments and trace

d2f5f29

wip answer relevancy

9297b7d

fix comment

31f89c7

more wip

9b19dab

extract out input extraction

2942272

working answer relevancy implementaiton

dbbc046

ctx prec

8e4a452

Merge branch 'evan.li/remaining-ragas-metrics' into evan.li/answer-re…

244bba9

…levancy

answer relevancy

d246bc5

Merge branch 'evan.li/remaining-ragas-metrics' of github.com:DataDog/…

956e32d

…dd-trace-py into evan.li/remaining-ragas-metrics

Merge branch 'evan.li/remaining-ragas-metrics' of github.com:DataDog/…

a217688

…dd-trace-py into evan.li/answer-relevancy

add calc sim workflow, update tests

812a162

lievan requested a review from a team as a code owner December 16, 2024 15:25

lievan changed the base branch from evan.li/remaining-ragas-metrics to main December 16, 2024 15:38

lievan changed the base branch from main to evan.li/remaining-ragas-metrics December 16, 2024 15:38

Yun-Kim reviewed Dec 30, 2024

View reviewed changes

Yun-Kim mentioned this pull request Jan 10, 2025

chore(llmobs): implement ragas context precision #11716

Open

2 tasks

save

4c53378

lievan requested a review from a team as a code owner January 13, 2025 17:15

lievan requested review from a team as code owners January 13, 2025 17:15

lievan requested review from ZStriker19, erikayasuda, nikita-tkachenko-datadog and wconti27 and removed request for a team January 13, 2025 17:15

lievan changed the base branch from evan.li/remaining-ragas-metrics to main January 13, 2025 17:15

lievan requested review from a team and removed request for a team, ZStriker19, erikayasuda, nikita-tkachenko-datadog and wconti27 January 13, 2025 17:16

lievan closed this Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(llmobs): implement answer relevancy ragas metric #11738

chore(llmobs): implement answer relevancy ragas metric #11738

lievan commented Dec 16, 2024 •

edited

Loading

github-actions bot commented Dec 16, 2024

datadog-dd-trace-py-rkomorn bot commented Dec 16, 2024

pr-commenter bot commented Dec 16, 2024

Yun-Kim Dec 30, 2024

lievan Jan 13, 2025

Yun-Kim Dec 30, 2024

Yun-Kim Dec 30, 2024

lievan Jan 13, 2025

lievan commented Jan 13, 2025 •

edited

Loading

chore(llmobs): implement answer relevancy ragas metric #11738

chore(llmobs): implement answer relevancy ragas metric #11738

Conversation

lievan commented Dec 16, 2024 • edited Loading

About Answer Relevancy

Example trace

Checklist

Reviewer Checklist

github-actions bot commented Dec 16, 2024

datadog-dd-trace-py-rkomorn bot commented Dec 16, 2024

Datadog Report

pr-commenter bot commented Dec 16, 2024

Benchmarks

Yun-Kim Dec 30, 2024

Choose a reason for hiding this comment

lievan Jan 13, 2025

Choose a reason for hiding this comment

Yun-Kim Dec 30, 2024

Choose a reason for hiding this comment

Yun-Kim Dec 30, 2024

Choose a reason for hiding this comment

lievan Jan 13, 2025

Choose a reason for hiding this comment

lievan commented Jan 13, 2025 • edited Loading

lievan commented Dec 16, 2024 •

edited

Loading

lievan commented Jan 13, 2025 •

edited

Loading