Finishing touches and cleanup

BU-Spark · Dec 12, 2024 · a9e136f · a9e136f
1 parent 2b516d5
commit a9e136f
Show file tree

Hide file tree

Showing 5 changed files with 174 additions and 75 deletions.
diff --git a/EVALUATION.md b/EVALUATION.md
@@ -0,0 +1,98 @@
+# Analysis of RAG Responses
+
+This document analyzes the performance of a Retrieval-Augmented Generation (RAG) system tested on queries categorized into three typologies: specific queries, vague queries, and gibberish or unknown facts. Below are detailed insights into each category, based on the test results and observations.
+
+## Specific Queries
+
+### Examples
+
+- **Query:** Where is boston?  
+
+  **Response:** Boston is located in the state of Massachusetts, in Suffolk County, United States, which is part of North and Central America.  
+
+  **Source:** While the response mentions Suffolk County, the sources only contained this as a keyword and did not explicitly state the connection.  
+
+- **Query:** Who was Hitler?  
+
+  **Response:** Sorry, I was unable to find any documents relevant to your query.  
+
+  **Source:** The sources contained information about World War II but did not specifically address Hitler, even if the BPL includes material about him.  
+
+- **Query:** Do you know Ken Follet?  
+
+  **Response:** Based on the provided context, there is no information available about Ken Follet.  
+
+  **Sources:** The sources retrieved were not related to Ken Follett. It appears that the Boston Public Library (BPL) does not have any material about him.  
+
+- **Query:** Do you know the book 1984?  
+
+  **Response:** Sorry, I was unable to find any documents relevant to your query.  
+
+  **Sources:** The retrieved sources were not related to 1984. The BPL database lacks content specific to this book.  
+
+### Notes
+The RAG system can easily respond with, "Sorry, I was unable to find any documents relevant to your query," when it lacks specific information to generate a meaningful answer. However, when it does provide answers, they are typically well-constructed and accurate. For the RAG to function effectively, the documents it accesses must also be highly specific. For example, while there may be numerous images of Hitler, the absence of descriptive text accompanying them makes it impossible for the RAG to provide a comprehensive answer. Additionally, since we are working with the Digital Commonwealth rather than a traditional library catalog, it is understandable that there might not be sufficient material on topics like Ken Follett or the book 1984.
+
+## Vague Queries
+
+### Examples
+- **Query:** Tell me about neurology.  
+
+  **Response:** Sorry, I was unable to find any documents relevant to your query.  
+
+  **Sources:** While the response indicates no relevant documents, the sources reportedly contained meaningful information about neurology.  
+
+- **Query:** Boston.  
+
+  **Response:** The context provides a comprehensive overview of Boston, highlighting its historical and geographical aspects. It includes a south-east view of the city, a view of the State Capitol, and a depiction of the regional open space system. It also provides historical maps of Boston, including one from the 18th century that shows the city's maritime economy and settlement patterns, and a re-issue of this map from 1835. The context also mentions various landmarks, such as the Ether Monument Public Garden, the site of Anne Hutchinson’s house, and the John Hancock house. The city's history is also touched upon, with references to its settlement by Europeans in 1630 and its previous habitation by Native Americans.  
+
+  **Sources:** The sources are highly relevant and align well with the query, providing detailed historical and geographical information about Boston.  
+
+- **Query:** What happened in 1984?  
+
+  **Response:** Sorry, I was unable to find any documents relevant to your query.  
+
+  **Sources:** All retrieved sources were editions of the same journal from 1984. While temporally related, the content did not address the question directly.  
+
+### Notes
+Vague queries pose an interesting challenge for the RAG system. The RAG system often responds with, "Sorry, I was unable to find any documents relevant to your query," even when it retrieves some interesting and relevant sources. It appears that only sufficiently descriptive sources enable the system to generate proper answers. When it does provide an answer, it is typically accurate and on point. However, the database is structured in a way that allows multiple editions of the same journal to exist, leading the RAG to retrieve all editions as meaningful sources, which can be somewhat redundant. To improve the user experience, it would be helpful to limit these cases by reporting only one edition of a journal.
+
+## Gibberish and Unknown Facts
+
+### Examples
+- **Query:** Do you know De Bruyne?  
+
+  **Response:** Sorry, I was unable to find any documents relevant to your query.  
+
+  **Sources:** As expected, the response indicates no knowledge of De Bruyne. The retrieved sources were about someone with a similar name.  
+
+- **Query:** Qwghkjhgf.  
+
+  **Response:** Sorry, I was unable to find any documents relevant to your query.  
+
+  **Sources:** The retrieved sources were random, as expected for a nonsensical query.  
+
+- **Query:** Hello, I really like dogs, can you give me some dogs please? Thank you!  
+
+  **Response:** Sorry, I was unable to find any documents relevant to your query.  
+
+  **Sources:** The retrieved sources were about dogs.  
+
+### Notes
+The system is behaving as expected: when no relevant answer can be generated, the sources retrieved are either random or associated based on similarity in names or words. For instance, a query about "de Bruyne" results in "Les femmes de Bruges," as these are the most similar terms in the database. (Since de Bruyne is a contemporary football player, it is entirely reasonable that the Digital Commonwealth does not contain any information about him.)
+
+## Query in Different Language (Italian)
+
+### Example
+
+- **Query:** Ciao, dove si trova boston?  
+
+  **Response:** Boston si trova negli Stati Uniti, nello stato del Massachusetts. / Sorry, I was unable to find any documents relevant to your query.  
+
+  **Source:** The sources are about Boston, but not as the ones for the same English query / The sources are about Italy, but not related to Boston itself (e.g., Milan or Rome).  
+
+### Notes
+Working with another language makes it challenging to receive the same answer consistently. Sometimes, the system provides the correct response (identical to the English version but translated into Italian) and sometimes the default message: "Sorry, I was unable to find any documents relevant to your query." Additionally, the sources retrieved vary from case to case, and the accuracy of the answer seems to depend on the quality and relevance of these sources. It's interesting to see how an Italian query can correspond to sources about Italy and not about the query itself.
+
+## Final Disclaimer
+This test was conducted on a partial database. The inability of the RAG system to find specific information may be due to the absence of relevant data in the current product configuration, even though such information might exist in the complete database.
diff --git a/RAG.py b/RAG.py
@@ -125,6 +125,9 @@ def RAG(llm: Any, query: str,vectorstore:PineconeVectorStore, top: int = 10, k:
     """Main RAG function with improved error handling and validation."""
     start = time.time()
     try:
+
+        # Query alignment is commented our, however I have decided to leave it in for potential future use.
+
         # Retrieve initial documents using rephrased query -- not working as intended currently, maybe would be better for data with more words.
         # query_template = PromptTemplate.from_template(
         #     """

diff --git a/WRITEUP.md b/WRITEUP.md
@@ -187,7 +187,7 @@ Our implementation uses GPT-4o-mini from OpenAI, however you could fit in your o
 Also, make sure to replace the variable INDEX_NAME in streamlit-app.py with the name of your index.
 
 ```
-OPENAI_API_KEY = "<you-api-key>"
+OPENAI_API_KEY = "<your-api-key>"
 ```
 
 Once you do that, you're ready to run.
@@ -201,6 +201,13 @@ This will run the app on port 8501. When querying, please be patient, sometimes
 **On-going Challenges**
  - Vector Store: 1.3 million metadata objects and 147,000 full text docs resulted in a cumulative ~140GB of vectors when using all-MiniLM-L6-v2 and recursive character splitting on 1000 characters. This made locally hosted vectorstores cumbersome and implausible. Our solution was to migrate a portion of our metadata vectors to Pinecone and used that in our final implementation. Hosting on Pinecone can become expensive and adds another dimension of complexity to the project.
  - Speed: Currently, the app takes anywhere from 25-70sec to generate a response, we have found that the most time-consuming aspect of this is our calls to the Digital Commonwealth API to retrieve the rest of the metadata for each object retrieved within Pinecone. We were unable to associate an object's full metadata in Pinecone due to internal limits, so we are hitting the Digital Commonwealth API to do so. On average, responses take 1/4 of a sec, however across 100 responses that becomes cumbersome.
- - Query Alignment: The way queries are worded can have impact on the quality of retrieval. We attempted to implement a form of query alignment by using an llm to generate a sample response to the query, however we found it to be ineffective and detrimental. Further research should be done in this area to improve standardization of query alignment ot improve retrieval.
+ - Query Alignment: The way queries are worded can have impact on the quality of retrieval. We attempted to implement a form of query alignment by using an llm to generate a sample response to the query, however we found it to be ineffective and detrimental. One specific aspect is the efficacy of specific queries versus vague ones ("Who wrote letters to WEB Du Bois?" vs "What happened in Boston in 1919?"). Queries as a whole may benefit from segmentation into the likely metadata fields they contain in order to inform querying (set up separate vectorstores for each field and then retrieve different parts of the query respectively). Further research should be done in this area to improve standardization of query alignment ot improve retrieval.
  - Usage Monitoring: Real-time usage monitoring through the console logs is implemented, however it would be beneficial to implement a form of persistent usage monitoring for generating insights into model performance and query wording for the purpose of ML/OPs.
-
+
+ **Ad Hoc Process/Recommendations**
+  - Our Demo on huggingface (that will temporarily be hooked up to a group member's personal Pinecone index before being disconnected after submission) only included retrieval over 600k entries in the Digital Commonwealth API. Each entry's title fields and abstract were embedded and input into the vectorstore. We first retrieve the top 100 related vectors to the query (with the intent to reduce vectorstore size and only retrieve on topical relevance), then we retrieve the metadata for certain fields from each retrieved vector's source id deemed related to queries (abstract, title, format, etc.) and rerank with BM25 off of that (with the intent to then prioritize entries on metadata like format and date). This was a way to effectively put together a quick demo.
+  - The size of the data is significant in size and largely grows with vectorsize assuming you are significantly chunking each entry. It is our formal recommendation that you host your vectorstore on Pinecone or another service for efficient retrieval and initialization as well as in consideration of the storage of huggingface spaces.
+  - As mentioned previously, a way to segment and analyze each query prior to retrieval could create a more reliable and accuracte retriever. Also of not is our prompt engineering. We strongly suggest using XML tags and a parser for efficient Chain of Thought in order to minimize llm calls.
+  - Currently we are linearly hitting the Digital Commonwealth API for metadata once we retrieve the top 100 vectors in order to perform reranking and contextual addition to the prompt. This is really slow. We recommend that you either forego this method for some other or parallelize your calls (we tried parallelization, however found that rate limiting was too severe). A solution might be to create a metadata database and initialize it only on startup for referencing or to create proxies for api parallelization.
+
+  Thank You and Best of Luck!
diff --git a/dataset-documentation/DATASETDOC-fa24.md b/dataset-documentation/DATASETDOC-fa24.md
@@ -0,0 +1,59 @@
+***Project Information*** 
+
+* The project name is LibRAG (Retrieval Augmented Generation)
+* https://github.com/BU-Spark/ml-bpl-rag/tree/main   
+* [Google Drive](https://drive.google.com/drive/folders/12_tsVcUgwdfUdXalD67NOgUL3tGeI6ss?usp=sharing)
+* This project involved implementing natural language querying into the Digial Commonwealth project.
+* Client: Boston Public Library
+* Contact: Eben English 
+* Class: DS549
+
+***Dataset Information***
+
+* Our data is contained on the SCC at /projectnb/sparkgrp/ml-bpl-rag-data  
+  * /vectorstore/final_embeddings/metadata_index - faiss index for the metadata
+  * /vectorstore/final_embeddings/fulltext_index - faiss index for the OCR text
+  * /full_data/bpl_data.json - metadata
+  * /full_data/clean_ft.json - fulltext
+* We did not have formal datasets, instead we used the Digital Commonwealth API and created embeddings from it. There is no need for a data dictionary outside of [Digital Commonwealth API](https://github.com/boston-library/solr-core-conf/wiki/SolrDocument-field-reference:-public-API).
+* What keywords or tags would you attach to the data set?  
+  * Domain(s) of Application: Natural Language Processing, Library Science 
+  * Civic tech
+
+*The following questions pertain to the datasets you used in your project.*   
+*Motivation* 
+
+* We needed to create embeddings of the Digital Commonwealth's data in order to perform retrieval
+
+*Composition*
+
+* Each entry in the Digital Commonwealth API represents an object in their repo of varying format  
+* There were ~1.3 million total objects last we checked, about 147,000 of which containing full-text from OCR'd documents. 
+* Our data was a comprehensive snapshot, the API is being updated.
+* Each field from the API represented metadata classifications   
+* Data is publicly accessible and non-confidential
+
+*Collection Process*
+
+* We collected data from an API endpoint.
+* No sampling was performed
+* This data was collected in October 2024
+
+*Preprocessing/cleaning/labeling* 
+
+* Very limited character correction was performed on the fulltext data.
+* No transformations were applied outside of embedding.
+* The raw data is saved in ml-bpl-rag-data/full_data/bpl_data.json (metadata) clean_ft.json (fulltext)
+
+*Uses* 
+
+* Embedding for retrieval
+
+*Distribution*
+
+* This data is free to use and access by subsequent students of our project.
+
+*Maintenance* 
+
+There is currently no system in place for cleanly updating the data, though in our instructions within WRITEUP.md we include a way to ingest your own data from the API and embed it.
+
diff --git a/load_script.py b/load_script.py
@@ -1,140 +1,72 @@
-
-
 import json
-
 import time
-
 import os
-
 import sys
-
 import requests
 
-
 def fetch_digital_commonwealth():
-
     start = time.time()
-
     BASE_URL = "https://www.digitalcommonwealth.org/search.json?search_field=all_fields&per_page=100&q="
-
     PAGE = sys.argv[1]
-
     END_PAGE = sys.argv[2]
-
     file_name = f"out{PAGE}_{END_PAGE}.json"
-
-    FINAL_PAGE = 13038
-
+    FINAL_PAGE = 13038 # hardcoded from old version, I suggest doing logic to determine final page. This was used to keep us from going out of index.
     output = []
-
     file_path = f"./{file_name}"
-
     # file_path = './output.json'
-
     if os.path.exists(file_path):
-
         with open(file_path,'r') as file:
-
             output = json.load(file)
-
             if int(PAGE) < (len(output) + 1):
-
                 PAGE = len(output) + 1
-
 
-
     if int(PAGE) >= int(END_PAGE):
-
         return None
-
     print(f'Reading page {PAGE} up to page {END_PAGE}')
 
     retries = 0
 
     while True:
-
         try:
-
             response = requests.get(f"{BASE_URL}&page={PAGE}")
-
             response.raise_for_status()
-
             data = response.json()
-
-
-
+
             # Append current page data to the output list
-
             output.append(data)
-
 
-
             # Save the entire output to a JSON file after each iteration
-
             with open(file_path, 'w') as f:
-
                 json.dump(output, f)
 
 
-
-
-
             # check if theres a next page
-
             # print(len(response))
-
             if data['meta']['pages']['next_page']:
-
                 if data['meta']['pages']['next_page'] == int(END_PAGE):
-
                     print(f"Processed and saved page {PAGE}. Total pages saved: {len(output)}")
-
                     break
-
-                elif data['meta']['pages']['next_page'] == FINAL_PAGE:
-
+                elif data['meta']['pages']['next_page'] == FINAL_PAGE: # This is hardcoded from an old version
                     print(f"finished page {PAGE}")
-
                     PAGE = FINAL_PAGE
-
                 else:
-
                     print(f"finished page {PAGE}")
-
                     PAGE = data['meta']['pages']['next_page']
-
             else:
-
                 print(f"Processed and saved page {PAGE}. Total pages saved: {len(output)}")
-
                 break
-
 
-
             retries = 0
-
-            # Optional: Add a small delay to avoid overwhelming the API
-
-            # time.sleep(0.5)
-
+            # time.sleep(0.5) was concerned about rate limiting
         except requests.exceptions.RequestException as e:
-
             print(f"An error occurred: {e}")
-
             print(f"Processed and saved page {PAGE}. Total pages saved: {len(output)}")
-
             retries += 1
-
             if retries >= 5:
-
                 break
 
     end = time.time()
-
     print(f"Timer: {end - start}")
-
     print(f"Finished processing all pages. Total pages saved: {len(output)}")
-
 if __name__ == "__main__":
-
     fetch_digital_commonwealth()