Push all changes from staging repo

databricks · Jun 10, 2024 · 08544cb · 08544cb
1 parent 6cacd1f
commit 08544cb
Show file tree

Hide file tree

Showing 151 changed files with 3,422 additions and 5,222 deletions.
diff --git a/dev_requirements.txt b/dev_requirements.txt
@@ -0,0 +1,2 @@
+jupyter-book
+livereload
diff --git a/genai_cookbook/_toc.yml b/genai_cookbook/_toc.yml
@@ -4,6 +4,9 @@
 format: jb-book
 root: index
 parts:
+- caption: "Overview"
+  chapters:
+    - file: index-2
 - caption: "Learn"
   numbered: true
   chapters:
@@ -25,7 +28,7 @@ parts:
     - file: nbs/4-evaluation-infra
   - file: nbs/5-rag-development-workflow
 - caption: "Implement"
-  numbered: true
+  numbered: false
   chapters:
   - file: nbs/5-hands-on-requirements
   - file: nbs/6-implement-overview
@@ -41,13 +44,15 @@ parts:
 # - caption: "Build: Debug & iterate on RAG quality"  
 #   numbered: true
 #   chapters:
-  - file: nbs/5-hands-on-improve-quality
+  # - file: nbs/5-hands-on-improve-quality
+  #   sections:
+  - file: nbs/5-hands-on-improve-quality-step-1
+    sections:
+    - file: nbs/5-hands-on-improve-quality-step-1-retrieval
+    - file: nbs/5-hands-on-improve-quality-step-1-generation
+  - file: nbs/5-hands-on-improve-quality-step-2
     sections:
-    - file: nbs/5-hands-on-improve-quality-step-1
-      sections:
-      - file: nbs/5-hands-on-improve-quality-step-1-retrieval
-      - file: nbs/5-hands-on-improve-quality-step-1-generation
-    - file: nbs/5-hands-on-improve-quality-step-2
+    - file: nbs/5-hands-on-improve-quality-step-2-data-pipeline
   - file: nbs/5-hands-on-deploy-and-monitor
 # - caption: "Deploy & monitor a RAG app"
 #   chapters:

diff --git a/genai_cookbook/images/.DS_Store b/genai_cookbook/images/.DS_Store
diff --git a/genai_cookbook/images/5-hands-on/1_img.png b/genai_cookbook/images/5-hands-on/1_img.png
diff --git a/genai_cookbook/images/5-hands-on/fail.png b/genai_cookbook/images/5-hands-on/fail.png
diff --git a/genai_cookbook/images/5-hands-on/pass.png b/genai_cookbook/images/5-hands-on/pass.png
diff --git a/genai_cookbook/images/5-hands-on/workflow.png b/genai_cookbook/images/5-hands-on/workflow.png
diff --git a/genai_cookbook/images/5-hands-on/workflow_baseline.png b/genai_cookbook/images/5-hands-on/workflow_baseline.png
diff --git a/genai_cookbook/images/5-hands-on/workflow_deploy.png b/genai_cookbook/images/5-hands-on/workflow_deploy.png
diff --git a/genai_cookbook/images/5-hands-on/workflow_evalset.png b/genai_cookbook/images/5-hands-on/workflow_evalset.png
diff --git a/genai_cookbook/images/5-hands-on/workflow_gather.png b/genai_cookbook/images/5-hands-on/workflow_gather.png
diff --git a/genai_cookbook/images/5-hands-on/workflow_iterate.png b/genai_cookbook/images/5-hands-on/workflow_iterate.png
diff --git a/genai_cookbook/images/5-hands-on/workflow_poc.png b/genai_cookbook/images/5-hands-on/workflow_poc.png
diff --git a/implement_sample_code/dbxquality.png → genai_cookbook/images/index/dbxquality.png b/implement_sample_code/dbxquality.png → genai_cookbook/images/index/dbxquality.png
diff --git a/genai_cookbook/index-2.md b/genai_cookbook/index-2.md
@@ -0,0 +1,80 @@
+---
+title: Databricks Generative AI Cookbook
+---
+
+# Databricks Generative AI Cookbook
+
+**TLDR;** this cookbook and its sample code will take you from initial POC to high-quality production-ready application using [Mosaic AI Quality Lab](https://docs.databricks.com/generative-ai/agent-evaluation/index.html) and [Mosaic AI Agent Framework](https://docs.databricks.com/generative-ai/retrieval-augmented-generation.html) on the Databricks platform.
+
+The Databricks Generative AI Cookbook is a definitive how-to guide for building *high-quality* generative AI applications. *High-quality* applications are applications that:
+1. **Accurate:** provide correct responses
+2. **Safe:** do not deliver harmful or insecure responses
+3. **Governed:** respect data permissions & access controls and track lineage
+
+Developed in partnership with Mosaic AI's research team, this cookbook lays out Databricks best-practice development workflow for building high-quality RAG apps: *evaluation driven development.* It outlines the most relevant knobs & approaches that can increase RAG application quality and provides a comprehensive repository of sample code implementing those techniques. 
+
+```{important}
+- Only have 10 minutes and want to see a demo of Mosaic AI Agent Framework & Quality lab?  Start [here](https://DBDEMO).
+- Want to hop into code and deploy a RAG POC using your data?  Start [here](./nbs/6-implement-overview.md).
+- Don't have any data, but want to deploy a sample RAG application?  Start here.
+```
+
+```{image} images/index/dbxquality.png
+:align: center
+```
+
+<br/>
+
+
+```{image} images/5-hands-on/review_app2.gif
+:align: center
+```
+
+<br/>
+
+This cookbook is intended for use with the Databricks platform.  Specifically:
+- [Mosaic AI Agent Framework](https://docs.databricks.com/generative-ai/retrieval-augmented-generation.html) which provides a fast developer workflow with enterprise-ready LLMops & governance
+- [Mosaic AI Quality Lab](https://docs.databricks.com/generative-ai/agent-evaluation/index.html) which provides reliable, quality measurement using proprietary AI-assisted LLM judges to measure quality metrics that are powered by human feedback collected through an intuitive web-based chat UI
+
+
+# Retrieval-augmented generation (RAG)
+
+> This first release focuses on retrieval-augmented generation (RAG).  Future releases will include the other popular generative AI techniques: agents & function calling, prompt engineering, fine tuning, and pre-training.
+
+The RAG cookbook is divided into 2 sections:
+1. [**Learn:**](#learn) Understand the required components of a production-ready, high-quality RAG application
+2. [**Implement:**](#implement) Use our sample code to follow an evaluation-driven workflow for delivering a high-quality RAG application
+
+## Code-based quick starts
+
+| Time required | Outcome | Link |
+|------ | ---- | ---- |
+| 🕧 <br/> 10 minutes | Sample RAG app deployed to web-based chat app that collects feedback | [RAG Demo]((https://DBDEMO)) |
+| 🕧🕧🕧 <br/>60 minutes | POC RAG app with *your data* deployed to a chat UI that can collect feedback from your business stakeholders | [Build & deploy a POC](./nbs/5-hands-on-build-poc.md)|
+| 🕧🕧 <br/>30 minutes | Comprehensive quality/cost/latency evaluation of your POC app | - [Evaluate your POC](./nbs/5-hands-on-evaluate-poc.md) <br/> - [Identify the root causes of quality issues](./nbs/5-hands-on-improve-quality-step-1.md) |
+
+
+
+## Table of contents
+<!--
+**Table of contents**
+1. [RAG overview](./nbs/1-introduction-to-rag): Understand how RAG works at a high-level
+2. [RAG fundamentals](./nbs/2-fundamentals-unstructured): Understand the key components in a RAG app
+3. [RAG quality knobs](./nbs/3-deep-dive): Understand the knobs Databricks recommends tuning improve RAG app quality 
+4. [RAG quality evaluation deep dive](./nbs/4-evaluation): Understand how RAG evaluation works, including creating evaluation sets, the quality metrics that matter, and required developer tooling
+5. [Evaluation-driven development](nbs/5-rag-development-workflow.md): Understand Databricks recommended development workflow for building, testing, and deploying a high-quality RAG application: evaluation-driven development-->
+
+```{tableofcontents}
+```
+<!--
+#### Implement
+
+**Table of contents**
+
+
+1. [Gather Requirements](./nbs/5-hands-on-requirements.md): Requirements you must discover from stakeholders before building a RAG app
+2. [Deploy POC to Collect Stakeholder Feedback](./nbs/5-hands-on-build-poc.md): Launch a proof of concept (POC) to gather feedback from stakeholders and understand baseline quality
+3. [Evaluate POC’s Quality](./nbs/5-hands-on-evaluate-poc.md): Assess the quality of your POC to identify areas for improvement
+4. [Root Cause & Iteratively Fix Quality Issues](./nbs/5-hands-on-improve-quality.md): Diagnose the root causes of any quality issues and apply iterative fixes to improve the app's quality
+5. [Deploy & Monitor](./nbs/5-hands-on-deploy-and-monitor.md): Deploy the finalized RAG app to production and continuously monitor its performance to ensure sustained quality.
+-->
diff --git a/genai_cookbook/index.md b/genai_cookbook/index.md
@@ -2,42 +2,71 @@
 title: Databricks Generative AI Cookbook
 ---
 
-# Databricks Mosaic Generative AI Cookbook
+# Databricks Generative AI Cookbook
 
-The Databricks Generative AI Cookbook is a definitive how-to guide for building *high-quality* generative AI applications. *High-quality* applications are:
-1. **Accurate:** provides correct responses
-2. **Safe:** does not deliver harmful or insecure responses
-3. **Governed:** respects permissions & access controls
+**TLDR;** this cookbook and its sample code will take you from initial POC to high-quality production-ready application using [Mosaic AI Quality Lab](https://docs.databricks.com/generative-ai/agent-evaluation/index.html) and [Mosaic AI Agent Framework](https://docs.databricks.com/generative-ai/retrieval-augmented-generation.html) on the Databricks platform.
 
-Developed in partnership with Mosaic AI's research team, this cookbook lays out Databricks best-practice development workflow for building high-quality RAG apps: *evaluation driven development.* It outlines the most relevant knobs & approaches that can increase quality and provides a comprehensive repository of sample code implementing those techniques. This code & cookbook will take you from initial POC to high-quality production-ready application.
+The Databricks Generative AI Cookbook is a definitive how-to guide for building *high-quality* generative AI applications. *High-quality* applications are applications that:
+1. **Accurate:** provide correct responses
+2. **Safe:** do not deliver harmful or insecure responses
+3. **Governed:** respect data permissions & access controls and track lineage
 
-> This first release focuses on retrieval-augmented generation (RAG).  Future releases will include the other popular generative AI techniques: agents & function calling, prompt engineering, fine tuning, and pre-training.
+Developed in partnership with Mosaic AI's research team, this cookbook lays out Databricks best-practice development workflow for building high-quality RAG apps: *evaluation driven development.* It outlines the most relevant knobs & approaches that can increase RAG application quality and provides a comprehensive repository of sample code implementing those techniques. 
 
-## Retrieval-augmented generation (RAG)
+```{important}
+- Only have 10 minutes and want to see a demo of Mosaic AI Agent Framework & Quality lab?  Start [here](https://DBDEMO).
+- Want to hop into code and deploy a RAG POC using your data?  Start [here](./nbs/6-implement-overview.md).
+- Don't have any data, but want to deploy a sample RAG application?  Start here.
+```
 
-The RAG cookbook is divided into 2 sections:
-1. [**Learn:**](#learn) Understand the required components of a production-ready, high-quality RAG application
-2. [**Implement:**](#implement) Use our sample code to follow the Databricks-recommended developer workflow for delivering a high-quality RAG application
+```{image} images/index/dbxquality.png
+:align: center
+```
 
+<br/>
 
-#### Learn
 
-**Table of contents**
-1. [RAG overview](./nbs/1-introduction-to-rag): High level overview of the basic concepts of RAG
-2. [RAG fundamentals](./nbs/2-fundamentals-unstructured): Introduction to the key components of a RAG application
-3. [RAG quality knobs](./nbs/3-deep-dive): Explains the knobs that Databricks recommends tuning in order to improve RAG application quality
-4. [RAG quality evaluation deep dive](./nbs/4-evaluation): Understand how RAG evaluation works, including creating evaluation sets, the quality metrics that matter, and required developer tooling
-5. [RAG development workflow](nbs/5-rag-development-workflow.md): Understand Databricks recommended development workflow for building, testing, and deploying a high-quality RAG application: evaluation-driven development
+```{image} images/5-hands-on/review_app2.gif
+:align: center
+```
+
+<br/>
+
+This cookbook is intended for use with the Databricks platform.  Specifically:
+- [Mosaic AI Agent Framework](https://docs.databricks.com/generative-ai/retrieval-augmented-generation.html) which provides a fast developer workflow with enterprise-ready LLMops & governance
+- [Mosaic AI Quality Lab](https://docs.databricks.com/generative-ai/agent-evaluation/index.html) which provides reliable, quality measurement using proprietary AI-assisted LLM judges to measure quality metrics that are powered by human feedback collected through an intuitive web-based chat UI
+
+
+# Retrieval-augmented generation (RAG)
+
+> This first release focuses on retrieval-augmented generation (RAG).  Future releases will include the other popular generative AI techniques: agents & function calling, prompt engineering, fine tuning, and pre-training.
 
-**Getting started**
+The RAG cookbook is divided into 2 sections:
+1. [**Learn:**](#learn) Understand the required components of a production-ready, high-quality RAG application
+2. [**Implement:**](#implement) Use our sample code to follow an evaluation-driven workflow for delivering a high-quality RAG application
+
+## Code-based quick starts
 
 | Time required | Outcome | Link |
 |------ | ---- | ---- |
 | 🕧 <br/> 10 minutes | Sample RAG app deployed to web-based chat app that collects feedback | [RAG Demo]((https://DBDEMO)) |
-| 🕧🕧🕧 <br/>60 minutes | POC RAG app with *your data* deployed to a chat UI that can collect feedback from your business stakeholders | [Build a POC](./nbs/5-hands-on-build-poc.md)|
-| 🕧🕧 <br/>30 minutes | Comprehensive quality/cost/latency evaluation of your POC app | [Evaluate your POC](./nbs/5-hands-on-evaluate-poc.md) |
+| 🕧🕧🕧 <br/>60 minutes | POC RAG app with *your data* deployed to a chat UI that can collect feedback from your business stakeholders | [Build & deploy a POC](./nbs/5-hands-on-build-poc.md)|
+| 🕧🕧 <br/>30 minutes | Comprehensive quality/cost/latency evaluation of your POC app | - [Evaluate your POC](./nbs/5-hands-on-evaluate-poc.md) <br/> - [Identify the root causes of quality issues](./nbs/5-hands-on-improve-quality-step-1.md) |
 
 
+
+## Table of contents
+<!--
+**Table of contents**
+1. [RAG overview](./nbs/1-introduction-to-rag): Understand how RAG works at a high-level
+2. [RAG fundamentals](./nbs/2-fundamentals-unstructured): Understand the key components in a RAG app
+3. [RAG quality knobs](./nbs/3-deep-dive): Understand the knobs Databricks recommends tuning improve RAG app quality 
+4. [RAG quality evaluation deep dive](./nbs/4-evaluation): Understand how RAG evaluation works, including creating evaluation sets, the quality metrics that matter, and required developer tooling
+5. [Evaluation-driven development](nbs/5-rag-development-workflow.md): Understand Databricks recommended development workflow for building, testing, and deploying a high-quality RAG application: evaluation-driven development-->
+
+```{tableofcontents}
+```
+<!--
 #### Implement
 
 **Table of contents**
@@ -48,12 +77,4 @@ The RAG cookbook is divided into 2 sections:
 3. [Evaluate POC’s Quality](./nbs/5-hands-on-evaluate-poc.md): Assess the quality of your POC to identify areas for improvement
 4. [Root Cause & Iteratively Fix Quality Issues](./nbs/5-hands-on-improve-quality.md): Diagnose the root causes of any quality issues and apply iterative fixes to improve the app's quality
 5. [Deploy & Monitor](./nbs/5-hands-on-deploy-and-monitor.md): Deploy the finalized RAG app to production and continuously monitor its performance to ensure sustained quality.
-
-**Getting started**
-
-
-| Time required | Outcome | Link |
-|------ | ---- |  ---- |
-| 🕧 <br/> 5 minutes | Understand how RAG works at a high-level | [Intro to RAG](./nbs/1-introduction-to-rag.md) |
-| 🕧🕧 <br/> 30 minutes  |Understand the key components in a RAG app |  [RAG fundamentals](./nbs/2-fundamentals-unstructured.md) |
-| 🕧🕧🕧 <br/> 60 minutes | Understand the knobs Databricks recommends tuning improve RAG app quality | [RAG quality knobs](./nbs/3-deep-dive.md) |
+-->
diff --git a/genai_cookbook/nbs/4-evaluation-eval-sets.md b/genai_cookbook/nbs/4-evaluation-eval-sets.md
@@ -8,7 +8,7 @@ A good evaluation set has the following characteristics:
 
 - **Representative:** Accurately reflects the variety of requests the application will encounter in production.
 - **Challenging:** The set should include difficult and diverse cases to effectively test the model's capabilities.  Ideally, it will include adversarial examples such as questions attempting prompt injection or questions attempting to generate inappropriate responses from LLM.
-- **Continually updated:** The set must be periodically updated to reflect how the application is used in production and the changing nature of the indexed data.
+- **Continually updated:** The set must be periodically updated to reflect how the application is used in production, the changing nature of the indexed data, and any changes to the application requirements.
 
 Databricks recommends at least 30 questions in your evaluation set, and ideally 100 - 200. The best evaluation sets will grow over time to contain 1,000s of questions.
 

diff --git a/genai_cookbook/nbs/4-evaluation-metrics.md b/genai_cookbook/nbs/4-evaluation-metrics.md
@@ -4,7 +4,7 @@ With an evaluation set, you are able to measure the performance of your RAG appl
 
 - **Retrieval quality**: Retrieval metrics assess how successfully your RAG application retrieves relevant supporting data. Precision and recall are two key retrieval metrics.
 - **Response quality**: Response quality metrics assess how well the RAG application responds to a user's request. Response metrics can measure, for instance, if the resulting answer is accurate per the ground-truth, how well-grounded the response was given the retrieved context (e.g., did the LLM hallucinate), or how safe the response was (e.g., no toxicity).
-- **Cost & latency:** Chain metrics capture the overall cost and performance of RAG applications. Overall latency and token consumption are examples of chain performance metrics.
+- **System performance (cost & latency):**  Metrics capture the overall cost and performance of RAG applications. Overall latency and token consumption are examples of chain performance metrics.
 
 It is very important to collect both response and retrieval metrics. A RAG application can respond poorly in spite of retrieving the correct context; it can also provide good responses on the basis of faulty retrievals. Only by measuring both components can we accurately diagnose and address issues in the application.
 
@@ -13,7 +13,9 @@ There are two key approaches to measuring performance across these metrics:
 - **Deterministic measurement:** Cost and latency metrics can be computed deterministically based on the application's outputs.  If your evaluation set includes a list of documents that contain the answer to a question, a subset of the retrieval metrics can also be computed deterministically.
 - **LLM judge based measurement** In this approach, a separate [LLM acts as a judge](https://arxiv.org/abs/2306.05685) to evaluate the quality of the RAG application's retrieval and responses.  Some LLM judges, such as answer correctness, compare the human-labeled ground truth vs. the app's outputs.  Other LLM judges, such as groundedness, do not require human-labeled ground truth to assess their app's outputs.
 
-Take time to ensure that the LLM judge's evaluations align with the RAG application's success criteria.
+```{important}
+For an LLM judge to be effective, it must be tuned to understand the use case. Doing so requires careful attention to understand where the judge does and does not work well, and then tuning the judge to improve it for the failure cases.
+```
 
 > [Mosaic AI Quality Lab](https://docs.databricks.com/generative-ai/agent-evaluation/index.html) provides an out-of-the-box implementation, using hosted LLM judge models, for each metric discussed on this page.  Quality Lab's documentation discusses the [details](https://docs.databricks.com/generative-ai/agent-evaluation/llm-judge-metrics.html) of how these metrics and judges are implemented and provides [capabilities](https://docs.databricks.com/generative-ai/agent-evaluation/advanced-agent-eval.html#provide-examples-to-the-built-in-llm-judges) to tune the judge's with your data to increase their accuracy