Skip to content

Commit

Permalink
Clean up of dev workflow and fundamentlas
Browse files Browse the repository at this point in the history
  • Loading branch information
epec254 committed Jun 8, 2024
1 parent 7e1be4f commit 7894718
Show file tree
Hide file tree
Showing 5 changed files with 11 additions and 13 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ During data preparation, the RAG application's data pipeline takes raw unstructu

In the remainder of this section, we describe the process of preparing unstructured data for retrieval using *semantic search*. Semantic search understands the contextual meaning and intent of a user query to provide more relevant search results.

Semantic search is one of several approaches that can be taken when implementing the retrieval component of a RAG application over unstructured data. We cover alternate retrieval strategies in the [retrieval deep dive section](/nbs/3-deep-dive).
Semantic search is one of several approaches that can be taken when implementing the retrieval component of a RAG application over unstructured data. We cover alternate retrieval strategies in the [retrieval knobs section](/nbs/3-deep-dive).



Expand Down
5 changes: 3 additions & 2 deletions genai_cookbook/nbs/2-fundamentals-unstructured-eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,14 @@ Evaluation and monitoring of Generative AI applications, including RAG, differs
| **Metrics** | Metrics evaluate the __inputs & outputs__ of the component e.g., feature drift, precision/recall, latency, etc <br/><br/> Since there is only one component, overall metrics == component metrics. | __Component metrics__ evaluate the __inputs & outputs__ of each component e.g., precision @ K, nDCG, latency, toxicity, etc <br/><br/>__Compound metrics__ evaluate how multiple components interact e.g., faithfulness measures the generator’s adherence to the knowledge from a retriever which requires the chain input, chain output, and output of the internal retriever<br/><br/>__Overall metrics__ evaluate the overall input & output of the system e.g., answer correctness, latency |
| **Evaluation** | Answer is __deterministically__ “right” or “wrong” <br/><br/> → __Deterministic metrics__ work | Answer is “right” or “wrong” but: <br/><ul><li>Many right answers (non deterministic)</li><li>Some right answers are more right</li></ul><br/>→ Need __human feedback__ to be confident<br/>→ Need __LLM-judged metrics__ to scale evaluation<br/> |

Effectively evaluating and monitoring application quality, cost and latency requires several components:


```{image} ../images/2-fundamentals-unstructured/4_img.png
:align: center
```
<br/>

Effectively evaluating and monitoring application quality, cost and latency requires several components:

- **Evaluation set:** To rigorously evaluate your RAG application, you need a curated set of evaluation queries (and ideally outputs) that are representative of the application's intended use. These evaluation examples should be challenging, diverse, and updated to reflect changing usage and requirements.

- **Metric definitions**: You can't manage what you don't measure. In order to improve RAG quality, it is essential to define what quality means for your use case. Depending on the application, important metrics might include response accuracy, latency, cost, or ratings from key stakeholders. You'll need metrics that measure each component, how the components interact with each other, and the overall system.
Expand Down
3 changes: 1 addition & 2 deletions genai_cookbook/nbs/2-fundamentals-unstructured.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,4 @@ This section will introduce the key components and principles behind developing
:alt: Major components of RAG over unstructured data
:align: center
```

The [next section](/nbs/3-deep-dive) of this guide will unpack the finer details of the typical components that make up the data pipeline and RAG chain of a RAG application using unstructured data.
<br/>
2 changes: 1 addition & 1 deletion genai_cookbook/nbs/3-deep-dive-chain.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Using the user query directly as a retrieval query can work for some queries. Ho
```{eval-rst}
.. note::
Filter extraction must be done in conjunction with changes to both metadata extraction [data pipeline] and retrieval [RAG chain] components. The metadata extraction step should ensure that the relevant metadata fields are available for each document/chunk, and the retrieval step should be implemented to accept and apply extracted filters.
Filter extraction must be done in conjunction with changes to both metadata extraction [data pipeline](./3-deep-dive-data-pipeline.md) and [retriever chain](#retrieval) components. The metadata extraction step should ensure that the relevant metadata fields are available for each document/chunk, and the retrieval step should be implemented to accept and apply extracted filters.
.. include:: ./include-rst.rst
```
Expand Down
12 changes: 5 additions & 7 deletions genai_cookbook/nbs/5-rag-development-workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,16 @@ This section walks you through Databricks recommended development workflow for b
```{image} ../images/5-hands-on/1_img.png
:align: center
```

Mapping to this workflow, this section provides ready-to-run sample code for every step and every suggestion to improve quality.

Throughout, we will demonstrate evaluation-driven development using one of Databricks' internal use generative AI cases: using a RAG bot to help answer customer support questions in order to [1] reduce support costs [2] improve the customer experience.
<br/>
The [implement](./5-hands-on-requirements.md) section of this cookbook provides a guided implementation of this workflow with sample code.

There are two core concepts in **evaluation-driven development:**

1. **Metrics:** Defining high-quality
1. [**Metrics:**](./4-evaluation-metrics.md) Defining what high-quality means

*Similar to how you set business goals each year, you need to define what high-quality means for your use case.* *Databricks' Quality Lab provides a suggested set of* *N metrics to use, the most important of which is answer accuracy or correctness - is the RAG application providing the right answer?*

2. **Evaluation:** Objectively measuring the metrics
2. [**Evaluation set:**](./4-evaluation-eval-sets.md) Objectively measuring the metrics

*To objectively measure quality, you need an evaluation set, which contains questions with known-good answers validated by humans. While this may seem scary at first - you probably don't have an evaluation set sitting ready to go - this guide walks you through the process of developing and iteratively refining this evaluation set.*

Expand All @@ -35,4 +33,4 @@ Anchoring against metrics and an evaluation set provides the following benefits:

2. Getting alignment with business stakeholders on the readiness of the application for production becomes more straightforward when you can confidently state, *"we know our application answers the most critical questions to our business correctly and doesn't hallucinate."*

*>> Evaluation-driven development is known in the academic research community as "hill climbing" akin to climbing a hill to reach the peak - where the hill is your metric and the peak is 100% accuracy on your evaluation set.*
> Evaluation-driven development is known in the academic research community as ["hill climbing"](https://en.wikipedia.org/wiki/Hill_climbing) akin to climbing a hill to reach the peak - where the hill is your metric and the peak is 100% accuracy on your evaluation set.

0 comments on commit 7894718

Please sign in to comment.