Skip to content

Commit

Permalink
Merge pull request #25 from sandbox-ai/README
Browse files Browse the repository at this point in the history
[MOD] New readme
  • Loading branch information
Kr4us authored Apr 11, 2024
2 parents e2db235 + 1f2aa2b commit 4847ced
Show file tree
Hide file tree
Showing 4 changed files with 17 additions and 13 deletions.
24 changes: 11 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
</p>

This repo contains the code and instructions to set up ALI, a web UI and back-end pipeline for Retrieval Augmented Generation around legal documents.
You can try ALI at [https://chat.omnibus.com.ar/](https://chat.omnibus.com.ar/) .
You can try ALI [here](https://ali.sandboxai.ar/) .

## Table of Contents
- [Installation](#installation)
Expand All @@ -37,7 +37,7 @@ You can try ALI at [https://chat.omnibus.com.ar/](https://chat.omnibus.com.ar/)

### Installation

Clone repo, create new environment, install backend and frontend requirements
Clone the repo, create a new environment and install backend and frontend requirements:
```bash
git clone https://github.com/sandbox-ai/ali.git
cd ali
Expand All @@ -51,7 +51,7 @@ npm install

**Usage**

Launch the backend in one terminal
Launch the backend in one terminal:
```bash
conda activate ali
export OPENAI_API_KEY=<your-key-here>
Expand All @@ -67,32 +67,30 @@ cd frontend
ng serve
```

Also see `backend/README.md` and `frontend/README.md`
Also see [`backend/README.md`](backend/README.md) and [`frontend/README.md`](frontend/README.md)


## Description
### Overview

The complexity of legal documents and legalese terminology presents a barrier to most of the population, who wont't able to interact and understand their own legal system unless aided by a professional in the field.
The complexity of legal documents and legalese terminology presents a barrier to most of the population, who won't able to interact and understand their own legal system unless aided by a professional in the field.

To help close this gap in the Spanish speaking world, we built the Asistente Legal Inteligente (or ALI). It uses Retrieval Augmented Generation ([RAG](https://arxiv.org/abs/2005.11401)), i.e. searching a vectorstore given an user query and formulating an answer with a Large Language Model (LLM), and a custom dataset that lays a RAG optimized structure.

The result is a grounded and comprehensive assistant that can answer questions about general legislation and specific legal situations.

### Technical information
#### General
The user query is embedded using a custom Spanish [embedding model](https://huggingface.co/dariolopez/roberta-base-bne-finetuned-msmarco-qa-es-mnrl-mn), then used to search for the best matching legal documents with cosine-similarity.

To formulate the answer, an LLM osted with a [OpenAI compatible API endpoint](https://platform.openai.com/docs/api-reference) is queried by passing a custom prompt and the best matching documents.
The user query is embedded using a custom Spanish [embedding model](https://huggingface.co/dariolopez/roberta-base-bne-finetuned-msmarco-qa-es-mnrl-mn), and then used to search for the best matching legal documents with cosine-similarity. To formulate the answer, an LLM hosted with an [OpenAI compatible API endpoint](https://platform.openai.com/docs/api-reference) is queried with a custom prompt and the relevant documents.

This technique has ample room for improvements. See our roadmap on [RAG improvements](#improvements-over-baseline-rag).
You can check out the RAG system written from scratch in `src/rag_session.py`
You can check out the RAG system written from scratch in [`src/rag_session.py`](src/rag_session.py)

#### On LLM frameworks
We've tested both [llama_index](https://github.com/run-llama/llama_index) and [langchain](https://github.com/langchain-ai/langchain), but found them too restrictive and in the end more cumbersome than developing our own pipeline over [transformers](https://huggingface.co/docs/transformers/en/index), enabling finer control and suprevision.

#### Main challenges encountered
The main problem we encountered with a RAG pipeline over argentinian legal data was the embedding of the information. This problem has two parts:
The main problem we encountered with a RAG pipeline over Argentinian legal data was the embedding of the information. This problem has two parts:

1. Embedding model:

Expand All @@ -114,7 +112,7 @@ instead of just
```


All the results that we found in relation to the embedding models were a direct conclusion of plotting the resulting embedding vectors with the dimensionality reduction technique [t-SNE: t-distributed Stochastic Neighbor Embedding](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). Note that there are alternatives such as [UMAP: Universal Manifold Approximation & Projection](https://github.com/lmcinnes/umap)
All the results we found in relation to the embedding models were a direct conclusion of plotting the resulting embedding vectors with the dimensionality reduction technique [t-SNE: t-distributed Stochastic Neighbor Embedding](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). Note that there are alternatives such as [UMAP: Universal Manifold Approximation & Projection](https://github.com/lmcinnes/umap)

#### Improvements over baseline RAG.
There are many improvements to be made over this baseline RAG. The following is a non-exhaustive list:
Expand All @@ -137,7 +135,7 @@ We have built [a tool to scrap](https://github.com/sandbox-ai/Boletin-Oficial-Ar
This raw dataset must be parsed into the format described earlier (prepending contextual metadata). Given the inconsistent and unpredictable formatting of the documents and texts, there is no simple programmatic parsing to automate the process. We found that various NLP techniques are useful in automating this task (prompting LLMs, sentence-transformers, NER).

#### Testing
[Ragas](https://github.com/explodinggradients/ragas) is a framework to evaluate RAG pipelines that could be used to test ALI. It is important to acknowledge the costs using a paid API (we tested this with a relatively small document and GPT-4 and spent 40 USD in half an hour)
[Ragas](https://github.com/explodinggradients/ragas) is a framework to evaluate RAG pipelines that could be used to test ALI. It is important to acknowledge the costs using a paid API (we tested this with a relatively small document and GPT-4 and spent 40 USD in half an hour!)

## Acknowledgements
Huge thanks to the [Justicio](https://github.com/bukosabino/justicio) team from Spain, who gave us a lot of tips and shared their embedding model with us
Expand All @@ -150,7 +148,7 @@ Definetly go check their project and talk to the creators!
4. Push to the branch: `git push origin my-new-feature`
5. Submit a pull request :D

Please refer to CONTRIBUTING.md for a detailed explanation of branch/commit naming conventions
Please refer to [CONTRIBUTING.md](CONTRIBUTING.md) for a detailed explanation of branch/commit naming conventions

## Who we are
We are a group of Argentinian developers named [`sandbox.ai`](https://sandbox-ai.github.io/).
Expand Down
2 changes: 2 additions & 0 deletions backend/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,5 @@ EXPOSE 5000
# default command
CMD ["python", "api.py", "-i", "0.0.0.0"]
# CMD ["tail", "-f", "/dev/null"]

#trigger
2 changes: 2 additions & 0 deletions frontend/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,5 @@ COPY --from=build-stage /app/dist/qafront/ /usr/share/nginx/html

# Copy nginx.conf
COPY nginx.conf /etc/nginx/conf.d/default.conf

#trigger
2 changes: 2 additions & 0 deletions nginx/docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,5 @@ services:
- ./:/etc/nginx/templates
extra_hosts:
- "host.docker.internal:host-gateway"

#trigger

0 comments on commit 4847ced

Please sign in to comment.