Merge pull request #25 from sandbox-ai/README

[MOD] New readme
sandbox-ai · Apr 11, 2024 · 4847ced · 4847ced
2 parents e2db235 + 1f2aa2b
commit 4847ced
Show file tree

Hide file tree

Showing 4 changed files with 17 additions and 13 deletions.
diff --git a/README.md b/README.md
@@ -17,7 +17,7 @@
 </p>
 
 This repo contains the code and instructions to set up ALI, a web UI and back-end pipeline for Retrieval Augmented Generation around legal documents.
-You can try ALI at [https://chat.omnibus.com.ar/](https://chat.omnibus.com.ar/) .
+You can try ALI [here](https://ali.sandboxai.ar/) .
 
 ## Table of Contents
 - [Installation](#installation)
@@ -37,7 +37,7 @@ You can try ALI at [https://chat.omnibus.com.ar/](https://chat.omnibus.com.ar/)
 
 ### Installation
 
-Clone repo, create new environment, install backend and frontend requirements
+Clone the repo, create a new environment and install backend and frontend requirements:
 ```bash
 git clone https://github.com/sandbox-ai/ali.git
 cd ali
@@ -51,7 +51,7 @@ npm install
 
 **Usage**
 
-Launch the backend in one terminal
+Launch the backend in one terminal:
 ```bash
 conda activate ali
 export OPENAI_API_KEY=<your-key-here>
@@ -67,32 +67,30 @@ cd frontend
 ng serve
 ```
 
-Also see `backend/README.md` and `frontend/README.md`    
+Also see [`backend/README.md`](backend/README.md) and [`frontend/README.md`](frontend/README.md)    
 
 
 ## Description
 ### Overview
 
-The complexity of legal documents and legalese terminology presents a barrier to most of the population, who wont't able to interact and understand their own legal system unless aided by a professional in the field.
+The complexity of legal documents and legalese terminology presents a barrier to most of the population, who won't able to interact and understand their own legal system unless aided by a professional in the field.
 
 To help close this gap in the Spanish speaking world, we built the Asistente Legal Inteligente (or ALI). It uses Retrieval Augmented Generation ([RAG](https://arxiv.org/abs/2005.11401)), i.e. searching a vectorstore given an user query and formulating an answer with a Large Language Model (LLM), and a custom dataset that lays a RAG optimized structure. 
 
 The result is a grounded and comprehensive assistant that can answer questions about general legislation and specific legal situations.
 
 ### Technical information
 #### General
-The user query is embedded using a custom Spanish [embedding model](https://huggingface.co/dariolopez/roberta-base-bne-finetuned-msmarco-qa-es-mnrl-mn), then used to search for the best matching legal documents with cosine-similarity.
-
-To formulate the answer, an LLM osted with a [OpenAI compatible API endpoint](https://platform.openai.com/docs/api-reference) is queried by passing a custom prompt and the best matching documents. 
+The user query is embedded using a custom Spanish [embedding model](https://huggingface.co/dariolopez/roberta-base-bne-finetuned-msmarco-qa-es-mnrl-mn), and then used to search for the best matching legal documents with cosine-similarity. To formulate the answer, an LLM hosted with an [OpenAI compatible API endpoint](https://platform.openai.com/docs/api-reference) is queried with a custom prompt and the relevant documents. 
 
 This technique has ample room for improvements. See our roadmap on [RAG improvements](#improvements-over-baseline-rag). 
-You can check out the RAG system written from scratch in `src/rag_session.py`   
+You can check out the RAG system written from scratch in [`src/rag_session.py`](src/rag_session.py)   
 
 #### On LLM frameworks
 We've tested both [llama_index](https://github.com/run-llama/llama_index) and [langchain](https://github.com/langchain-ai/langchain), but found them too restrictive and in the end more cumbersome than developing our own pipeline over [transformers](https://huggingface.co/docs/transformers/en/index), enabling finer control and suprevision. 
 
 #### Main challenges encountered
-The main problem we encountered with a RAG pipeline over argentinian legal data was the embedding of the information. This problem has two parts: 
+The main problem we encountered with a RAG pipeline over Argentinian legal data was the embedding of the information. This problem has two parts: 
 
 1. Embedding model:
 
@@ -114,7 +112,7 @@ instead of just
 ``` 
 
 
-All the results that we found in relation to the embedding models were a direct conclusion of plotting the resulting embedding vectors with the dimensionality reduction technique [t-SNE: t-distributed Stochastic Neighbor Embedding](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). Note that there are alternatives such as [UMAP: Universal Manifold Approximation & Projection](https://github.com/lmcinnes/umap)
+All the results we found in relation to the embedding models were a direct conclusion of plotting the resulting embedding vectors with the dimensionality reduction technique [t-SNE: t-distributed Stochastic Neighbor Embedding](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). Note that there are alternatives such as [UMAP: Universal Manifold Approximation & Projection](https://github.com/lmcinnes/umap)
 
 #### Improvements over baseline RAG. 
 There are many improvements to be made over this baseline RAG. The following is a non-exhaustive list: 
@@ -137,7 +135,7 @@ We have built [a tool to scrap](https://github.com/sandbox-ai/Boletin-Oficial-Ar
 This raw dataset must be parsed into the format described earlier (prepending contextual metadata). Given the inconsistent and unpredictable formatting of the documents and texts, there is no simple programmatic parsing to automate the process. We found that various NLP techniques are useful in automating this task (prompting LLMs, sentence-transformers, NER). 
 
 #### Testing
-[Ragas](https://github.com/explodinggradients/ragas) is a framework to evaluate RAG pipelines that could be used to test ALI. It is important to acknowledge the costs using a paid API (we tested this with a relatively small document and GPT-4 and spent 40 USD in half an hour)    
+[Ragas](https://github.com/explodinggradients/ragas) is a framework to evaluate RAG pipelines that could be used to test ALI. It is important to acknowledge the costs using a paid API (we tested this with a relatively small document and GPT-4 and spent 40 USD in half an hour!)    
 
 ## Acknowledgements
 Huge thanks to the [Justicio](https://github.com/bukosabino/justicio) team from Spain, who gave us a lot of tips and shared their embedding model with us
@@ -150,7 +148,7 @@ Definetly go check their project and talk to the creators!
 4. Push to the branch: `git push origin my-new-feature`
 5. Submit a pull request :D
 
-Please refer to CONTRIBUTING.md for a detailed explanation of branch/commit naming conventions
+Please refer to [CONTRIBUTING.md](CONTRIBUTING.md) for a detailed explanation of branch/commit naming conventions
 
 ## Who we are
 We are a group of Argentinian developers named [`sandbox.ai`](https://sandbox-ai.github.io/).

diff --git a/backend/Dockerfile b/backend/Dockerfile
@@ -16,3 +16,5 @@ EXPOSE 5000
 # default command
 CMD ["python", "api.py", "-i", "0.0.0.0"]
 # CMD ["tail", "-f", "/dev/null"]
+
+#trigger
diff --git a/frontend/Dockerfile b/frontend/Dockerfile
@@ -18,3 +18,5 @@ COPY --from=build-stage /app/dist/qafront/ /usr/share/nginx/html
 
 # Copy nginx.conf
 COPY nginx.conf /etc/nginx/conf.d/default.conf
+
+#trigger
diff --git a/nginx/docker-compose.yaml b/nginx/docker-compose.yaml
@@ -8,3 +8,5 @@ services:
       - ./:/etc/nginx/templates 
     extra_hosts:
       - "host.docker.internal:host-gateway"
+
+#trigger
Original file line number	Diff line number	Diff line change
Expand Up		@@ -18,3 +18,5 @@ COPY --from=build-stage /app/dist/qafront/ /usr/share/nginx/html

		# Copy nginx.conf
		COPY nginx.conf /etc/nginx/conf.d/default.conf

		#trigger