Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable milvus for retriever and data-preparation #858

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 35 additions & 77 deletions comps/dataprep/milvus/langchain/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
# Dataprep Microservice with Milvus

You can start retriever Microservice either by Python script or docker composer.

## 🚀1. Start Microservice with Python (Option 1)

### 1.1 Requirements

```bash
git clone https://github.com/opea-project/GenAIComps.git
cd GenAIComps/comps/dataprep/milvus/langchain
pip install -r requirements.txt
apt-get install tesseract-ocr -y
apt-get install libtesseract-dev -y
Expand All @@ -18,39 +22,11 @@ Please refer to this [readme](../../../vectorstores/milvus/README.md).
### 1.3 Setup Environment Variables

```bash
export no_proxy=${your_no_proxy}
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export MILVUS_HOST=${your_milvus_host_ip}
export MILVUS_PORT=19530
export COLLECTION_NAME=${your_collection_name}
export MOSEC_EMBEDDING_ENDPOINT=${your_embedding_endpoint}
```

### 1.4 Start Mosec Embedding Service

First, you need to build a mosec embedding serving docker image.

```bash
cd ../../..
docker build --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy -t opea/embedding-mosec-endpoint:latest -f comps/embeddings/mosec/langchain/dependency/Dockerfile .
export HUGGINGFACEHUB_API_TOKEN="YOUR_HUGGINGFACEHUB_API_TOKEN"
source set_env.sh
```

Then start the mosec embedding server.

```bash
your_port=6010
docker run -d --name="embedding-mosec-endpoint" -p $your_port:8000 opea/embedding-mosec-endpoint:latest
```

Setup environment variables:

```bash
export MOSEC_EMBEDDING_ENDPOINT="http://localhost:$your_port"
export MILVUS_HOST=${your_host_ip}
```

### 1.5 Start Document Preparation Microservice for Milvus with Python Script
### 1.4 Start Document Preparation Microservice for Milvus with Python Script

Start document preparation microservice for Milvus with below command.

Expand All @@ -60,43 +36,23 @@ python prepare_doc_milvus.py

## 🚀2. Start Microservice with Docker (Option 2)

### 2.1 Start Milvus Server

Please refer to this [readme](../../../vectorstores/milvus/README.md).

### 2.2 Build Docker Image
### 2.1 Build Docker Image

```bash
cd ../../..
# build mosec embedding docker image
docker build --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy -t opea/embedding-langchain-mosec-endpoint:latest -f comps/embeddings/mosec/langchain/dependency/Dockerfile .
git clone https://github.com/opea-project/GenAIComps.git
cd GenAIComps/comps/dataprep/milvus/langchain
# build dataprep milvus docker image
docker build -t opea/dataprep-milvus:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy --build-arg no_proxy=$no_proxy -f comps/dataprep/milvus/langchain/Dockerfile .
```

### 2.3 Setup Environment Variables

```bash
export MOSEC_EMBEDDING_ENDPOINT="http://localhost:$your_port"
export MILVUS_HOST=${your_host_ip}
docker compose build --no-cache
```

### 2.3 Run Docker with CLI (Option A)
### 2.2 Run with Docker Compose (Option B)

```bash
docker run -d --name="dataprep-milvus-server" -p 6010:6010 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e MOSEC_EMBEDDING_ENDPOINT=${MOSEC_EMBEDDING_ENDPOINT} -e MILVUS_HOST=${MILVUS_HOST} opea/dataprep-milvus:latest
```

### 2.4 Run with Docker Compose (Option B)

```bash
mkdir model
cd model
git clone https://huggingface.co/BAAI/bge-base-en-v1.5
cd ../
# Update `host_ip` and `HUGGINGFACEHUB_API_TOKEN` in set_env.sh
. set_env.sh
docker compose -f docker-compose-dataprep-milvus.yaml up -d
git clone https://github.com/opea-project/GenAIComps.git
cd GenAIComps/comps/dataprep/milvus/langchain
export HUGGINGFACEHUB_API_TOKEN="YOUR_HUGGINGFACEHUB_API_TOKEN"
source set_env.sh
docker compose -f docker-compose.yaml up -d
```

## 🚀3. Consume Microservice
Expand All @@ -105,15 +61,19 @@ docker compose -f docker-compose-dataprep-milvus.yaml up -d

Once document preparation microservice for Milvus is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.

Make sure the file path after `files=@` is correct.
Update Knowledge Base via Local File nke-10k-2023.pdf. Get the file on a terminal.

```
wget https://raw.githubusercontent.com/opea-project/GenAIComps/main/comps/retrievers/redis/data/nke-10k-2023.pdf
```

- Single file upload

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file.pdf" \
http://localhost:6010/v1/dataprep
-F "files=@./nke-10k-2023.pdf" \
http://localhost:6007/v1/dataprep
```

You can specify chunk_size and chunk_size by the following commands. To avoid big chunks, pass a small chun_size like 500 as below (default 1500).
Expand All @@ -124,7 +84,7 @@ curl -X POST \
-F "files=@./file.pdf" \
-F "chunk_size=500" \
-F "chunk_overlap=100" \
http://localhost:6010/v1/dataprep
http://localhost:6007/v1/dataprep
```

- Multiple file upload
Expand All @@ -135,15 +95,13 @@ curl -X POST \
-F "files=@./file1.pdf" \
-F "files=@./file2.pdf" \
-F "files=@./file3.pdf" \
http://localhost:6010/v1/dataprep
http://localhost:6007/v1/dataprep
```

- Links upload (not supported for llama_index now)

```bash
curl -X POST \
-F 'link_list=["https://www.ces.tech/"]' \
http://localhost:6010/v1/dataprep
http://localhost:6007/v1/dataprep
```

or
Expand All @@ -153,7 +111,7 @@ import requests
import json

proxies = {"http": ""}
url = "http://localhost:6010/v1/dataprep"
url = "http://localhost:6007/v1/dataprep"
urls = [
"https://towardsdatascience.com/no-gpu-no-party-fine-tune-bert-for-sentiment-analysis-with-vertex-ai-custom-jobs-d8fc410e908b?source=rss----7f60cf5620c9---4"
]
Expand All @@ -173,15 +131,15 @@ We support table extraction from pdf documents. You can specify process_table an
Note: If you specify "table_strategy=llm", You should first start TGI Service, please refer to 1.2.1, 1.3.1 in https://github.com/opea-project/GenAIComps/tree/main/comps/llms/README.md, and then `export TGI_LLM_ENDPOINT="http://${your_ip}:8008"`.

```bash
curl -X POST -H "Content-Type: application/json" -d '{"path":"/home/user/doc/your_document_name","process_table":true,"table_strategy":"hq"}' http://localhost:6010/v1/dataprep
curl -X POST -H "Content-Type: application/json" -d '{"path":"/home/user/doc/your_document_name","process_table":true,"table_strategy":"hq"}' http://localhost:6007/v1/dataprep
```

We support table extraction from pdf documents. You can specify process_table and table_strategy by the following commands. "table_strategy" refers to the strategies to understand tables for table retrieval. As the setting progresses from "fast" to "hq" to "llm," the focus shifts towards deeper table understanding at the expense of processing speed. The default strategy is "fast".

Note: If you specify "table_strategy=llm", You should first start TGI Service, please refer to 1.2.1, 1.3.1 in https://github.com/opea-project/GenAIComps/tree/main/comps/llms/README.md, and then `export TGI_LLM_ENDPOINT="http://${your_ip}:8008"`.

```bash
curl -X POST -H "Content-Type: application/json" -d '{"path":"/home/user/doc/your_document_name","process_table":true,"table_strategy":"hq"}' http://localhost:6010/v1/dataprep
curl -X POST -H "Content-Type: application/json" -d '{"path":"/home/user/doc/your_document_name","process_table":true,"table_strategy":"hq"}' http://localhost:6007/v1/dataprep
```

### 3.2 Consume get_file API
Expand All @@ -191,7 +149,7 @@ To get uploaded file structures, use the following command:
```bash
curl -X POST \
-H "Content-Type: application/json" \
http://localhost:6010/v1/dataprep/get_file
http://localhost:6007/v1/dataprep/get_file
```

Then you will get the response JSON like this:
Expand Down Expand Up @@ -224,19 +182,19 @@ The `file_path` here should be the `id` get from `/v1/dataprep/get_file` API.
curl -X POST \
-H "Content-Type: application/json" \
-d '{"file_path": "https://www.ces.tech/.txt"}' \
http://localhost:6010/v1/dataprep/delete_file
http://localhost:6007/v1/dataprep/delete_file

# delete file
curl -X POST \
-H "Content-Type: application/json" \
-d '{"file_path": "uploaded_file_1.txt"}' \
http://localhost:6010/v1/dataprep/delete_file
http://localhost:6007/v1/dataprep/delete_file

# delete all files and links, will drop the entire db collection
curl -X POST \
-H "Content-Type: application/json" \
-d '{"file_path": "all"}' \
http://localhost:6010/v1/dataprep/delete_file
http://localhost:6007/v1/dataprep/delete_file
```

## 🚀4. Troubleshooting
Expand All @@ -248,5 +206,5 @@ curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file.pdf" \
-F "chunk_size=500" \
http://localhost:6010/v1/dataprep
http://localhost:6007/v1/dataprep
```
47 changes: 41 additions & 6 deletions comps/dataprep/milvus/langchain/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

version: '3.5'

services:
etcd:
container_name: milvus-etcd
Expand All @@ -13,7 +11,7 @@ services:
- ETCD_QUOTA_BACKEND_BYTES=4294967296
- ETCD_SNAPSHOT_COUNT=50000
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
- ./volumes/etcd:/etcd
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
healthcheck:
test: ["CMD", "etcdctl", "endpoint", "health"]
Expand All @@ -31,7 +29,7 @@ services:
- "5044:9001"
- "5043:9000"
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
- ./volumes/minio:/minio_data
command: minio server /minio_data --console-address ":9001"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
Expand All @@ -49,8 +47,8 @@ services:
ETCD_ENDPOINTS: etcd:2379
MINIO_ADDRESS: minio:9000
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
- ${DOCKER_VOLUME_DIRECTORY:-.}/milvus.yaml:/milvus/configs/milvus.yaml
- ./volumes/milvus:/var/lib/milvus
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
interval: 30s
Expand All @@ -64,6 +62,43 @@ services:
- "etcd"
- "minio"

dataprep-redis-service:
image: ${REGISTRY:-opea}/dataprep-milvus:${TAG:-latest}
build:
context: ../../../..
dockerfile: ./comps/dataprep/milvus/langchain/Dockerfile
args:
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
container_name: test-comps-dataprep-milvus-server
ports:
- "6007:6007"
depends_on:
- standalone
- tei-embedding-service
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
MILVUS_HOST: ${MILVUS_HOST}
MILVUS_PORT: ${MILVUS_PORT}
TEI_EMBEDDING_ENDPOINT: http://tei-embedding-service:80
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}

tei-embedding-service:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.5
container_name: tei-embedding-server
ports:
- "6006:80"
volumes:
- "./data:/data"
shm_size: 1g
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
command: --model-id ${EMBEDDING_MODEL_ID} --auto-truncate

networks:
default:
name: milvus
driver: bridge
18 changes: 11 additions & 7 deletions comps/dataprep/milvus/langchain/prepare_doc_milvus.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,11 +88,12 @@ def ingest_chunks_to_milvus(file_name: str, chunks: List):
batch_docs = insert_docs[i : i + batch_size]

try:
url = "http://" + str(MILVUS_HOST) + ":" + str(MILVUS_PORT)
_ = Milvus.from_documents(
batch_docs,
embeddings,
collection_name=COLLECTION_NAME,
connection_args={"uri": milvus_uri},
connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "uri": url},
partition_key_field=partition_field_name,
)
except Exception as e:
Expand Down Expand Up @@ -190,7 +191,7 @@ def delete_by_partition_field(my_milvus, partition_field):
logger.info(f"[ delete partition ] delete success: {res}")


@register_microservice(name="opea_service@prepare_doc_milvus", endpoint="/v1/dataprep", host="0.0.0.0", port=6010)
@register_microservice(name="opea_service@prepare_doc_milvus", endpoint="/v1/dataprep", host="0.0.0.0", port=6007)
async def ingest_documents(
files: Optional[Union[UploadFile, List[UploadFile]]] = File(None),
link_list: Optional[str] = Form(None),
Expand All @@ -207,10 +208,11 @@ async def ingest_documents(
raise HTTPException(status_code=400, detail="Provide either a file or a string list, not both.")

# define Milvus obj
url = "http://" + str(MILVUS_HOST) + ":" + str(MILVUS_PORT)
my_milvus = Milvus(
embedding_function=embeddings,
collection_name=COLLECTION_NAME,
connection_args={"uri": milvus_uri},
connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "uri": url},
index_params=index_params,
auto_id=True,
)
Expand Down Expand Up @@ -336,17 +338,18 @@ async def ingest_documents(


@register_microservice(
name="opea_service@prepare_doc_milvus", endpoint="/v1/dataprep/get_file", host="0.0.0.0", port=6010
name="opea_service@prepare_doc_milvus", endpoint="/v1/dataprep/get_file", host="0.0.0.0", port=6007
)
async def rag_get_file_structure():
if logflag:
logger.info("[ get ] start to get file structure")

# define Milvus obj
url = "http://" + str(MILVUS_HOST) + ":" + str(MILVUS_PORT)
my_milvus = Milvus(
embedding_function=embeddings,
collection_name=COLLECTION_NAME,
connection_args={"uri": milvus_uri},
connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "uri": url},
index_params=index_params,
auto_id=True,
)
Expand Down Expand Up @@ -388,7 +391,7 @@ async def rag_get_file_structure():


@register_microservice(
name="opea_service@prepare_doc_milvus", endpoint="/v1/dataprep/delete_file", host="0.0.0.0", port=6010
name="opea_service@prepare_doc_milvus", endpoint="/v1/dataprep/delete_file", host="0.0.0.0", port=6007
)
async def delete_single_file(file_path: str = Body(..., embed=True)):
"""Delete file according to `file_path`.
Expand All @@ -401,10 +404,11 @@ async def delete_single_file(file_path: str = Body(..., embed=True)):
logger.info(file_path)

# define Milvus obj
url = "http://" + str(MILVUS_HOST) + ":" + str(MILVUS_PORT)
my_milvus = Milvus(
embedding_function=embeddings,
collection_name=COLLECTION_NAME,
connection_args={"uri": milvus_uri},
connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "uri": url},
index_params=index_params,
auto_id=True,
)
Expand Down
Loading
Loading