opea-project · xiguiw · Nov 6, 2024 · Dec 6, 2024
@@ -1,10 +1,14 @@
 # Dataprep Microservice with Milvus
 
+You can start retriever Microservice either by Python script or docker composer.
+
 ## 🚀1. Start Microservice with Python (Option 1)
 
 ### 1.1 Requirements
 
 ```bash
+git clone https://github.com/opea-project/GenAIComps.git
+cd GenAIComps/comps/dataprep/milvus/langchain
 pip install -r requirements.txt
 apt-get install tesseract-ocr -y
 apt-get install libtesseract-dev -y
@@ -18,39 +22,11 @@ Please refer to this [readme](../../../vectorstores/milvus/README.md).
 ### 1.3 Setup Environment Variables
 
 ```bash
-export no_proxy=${your_no_proxy}
-export http_proxy=${your_http_proxy}
-export https_proxy=${your_http_proxy}
-export MILVUS_HOST=${your_milvus_host_ip}
-export MILVUS_PORT=19530
-export COLLECTION_NAME=${your_collection_name}
-export MOSEC_EMBEDDING_ENDPOINT=${your_embedding_endpoint}
-```
-
-### 1.4 Start Mosec Embedding Service
-
-First, you need to build a mosec embedding serving docker image.
-
-```bash
-cd ../../..
-docker build --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy -t opea/embedding-mosec-endpoint:latest -f comps/embeddings/mosec/langchain/dependency/Dockerfile .
+export HUGGINGFACEHUB_API_TOKEN="YOUR_HUGGINGFACEHUB_API_TOKEN"
+source set_env.sh
 ```
 
-Then start the mosec embedding server.
-
-```bash
-your_port=6010
-docker run -d --name="embedding-mosec-endpoint" -p $your_port:8000  opea/embedding-mosec-endpoint:latest
-```
-
-Setup environment variables:
-
-```bash
-export MOSEC_EMBEDDING_ENDPOINT="http://localhost:$your_port"
-export MILVUS_HOST=${your_host_ip}
-```
-
-### 1.5 Start Document Preparation Microservice for Milvus with Python Script
+### 1.4 Start Document Preparation Microservice for Milvus with Python Script
 
 Start document preparation microservice for Milvus with below command.
 
@@ -60,43 +36,23 @@ python prepare_doc_milvus.py
 
 ## 🚀2. Start Microservice with Docker (Option 2)
 
-### 2.1 Start Milvus Server
-
-Please refer to this [readme](../../../vectorstores/milvus/README.md).
-
-### 2.2 Build Docker Image
+### 2.1 Build Docker Image
 
 ```bash
-cd ../../..
-# build mosec embedding docker image
-docker build --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy -t opea/embedding-langchain-mosec-endpoint:latest -f comps/embeddings/mosec/langchain/dependency/Dockerfile .
+git clone https://github.com/opea-project/GenAIComps.git
+cd GenAIComps/comps/dataprep/milvus/langchain
 # build dataprep milvus docker image
-docker build -t opea/dataprep-milvus:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy --build-arg no_proxy=$no_proxy -f comps/dataprep/milvus/langchain/Dockerfile .
-```
-
-### 2.3 Setup Environment Variables
-
-```bash
-export MOSEC_EMBEDDING_ENDPOINT="http://localhost:$your_port"
-export MILVUS_HOST=${your_host_ip}
+docker compose build --no-cache
 ```
 
-### 2.3 Run Docker with CLI (Option A)
+### 2.2 Run with Docker Compose (Option B)
 
 ```bash
-docker run -d --name="dataprep-milvus-server" -p 6010:6010 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e MOSEC_EMBEDDING_ENDPOINT=${MOSEC_EMBEDDING_ENDPOINT} -e MILVUS_HOST=${MILVUS_HOST} opea/dataprep-milvus:latest
-```
-
-### 2.4 Run with Docker Compose (Option B)
-
-```bash
-mkdir model
-cd model
-git clone https://huggingface.co/BAAI/bge-base-en-v1.5
-cd ../
-# Update `host_ip` and  `HUGGINGFACEHUB_API_TOKEN` in set_env.sh
-. set_env.sh
-docker compose -f docker-compose-dataprep-milvus.yaml up -d
+git clone https://github.com/opea-project/GenAIComps.git
+cd GenAIComps/comps/dataprep/milvus/langchain
+export HUGGINGFACEHUB_API_TOKEN="YOUR_HUGGINGFACEHUB_API_TOKEN"
+source set_env.sh
+docker compose -f docker-compose.yaml up -d
 ```
 
 ## 🚀3. Consume Microservice
@@ -105,15 +61,19 @@ docker compose -f docker-compose-dataprep-milvus.yaml up -d
 
 Once document preparation microservice for Milvus is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.
 
-Make sure the file path after `files=@` is correct.
+Update Knowledge Base via Local File nke-10k-2023.pdf. Get the file on a terminal.
+
+```
+wget https://raw.githubusercontent.com/opea-project/GenAIComps/main/comps/retrievers/redis/data/nke-10k-2023.pdf
+```
 
 - Single file upload
 
 ```bash
 curl -X POST \
     -H "Content-Type: multipart/form-data" \
-    -F "files=@./file.pdf" \
-    http://localhost:6010/v1/dataprep
+    -F "files=@./nke-10k-2023.pdf" \
+    http://localhost:6007/v1/dataprep
 ```
 
 You can specify chunk_size and chunk_size by the following commands. To avoid big chunks, pass a small chun_size like 500 as below (default 1500).
@@ -124,7 +84,7 @@ curl -X POST \
     -F "files=@./file.pdf" \
     -F "chunk_size=500" \
     -F "chunk_overlap=100" \
-    http://localhost:6010/v1/dataprep
+    http://localhost:6007/v1/dataprep
 ```
 
 - Multiple file upload
@@ -135,15 +95,13 @@ curl -X POST \
     -F "files=@./file1.pdf" \
     -F "files=@./file2.pdf" \
     -F "files=@./file3.pdf" \
-    http://localhost:6010/v1/dataprep
+    http://localhost:6007/v1/dataprep
 ```
 
-- Links upload (not supported for llama_index now)
-
 ```bash
 curl -X POST \
     -F 'link_list=["https://www.ces.tech/"]' \
-    http://localhost:6010/v1/dataprep
+    http://localhost:6007/v1/dataprep
 ```
 
 or
@@ -153,7 +111,7 @@ import requests
 import json
 
 proxies = {"http": ""}
-url = "http://localhost:6010/v1/dataprep"
+url = "http://localhost:6007/v1/dataprep"
 urls = [
     "https://towardsdatascience.com/no-gpu-no-party-fine-tune-bert-for-sentiment-analysis-with-vertex-ai-custom-jobs-d8fc410e908b?source=rss----7f60cf5620c9---4"
 ]
@@ -173,15 +131,15 @@ We support table extraction from pdf documents. You can specify process_table an
 Note: If you specify "table_strategy=llm", You should first start TGI Service, please refer to 1.2.1, 1.3.1 in https://github.com/opea-project/GenAIComps/tree/main/comps/llms/README.md, and then `export TGI_LLM_ENDPOINT="http://${your_ip}:8008"`.
 
 ```bash
-curl -X POST -H "Content-Type: application/json" -d '{"path":"/home/user/doc/your_document_name","process_table":true,"table_strategy":"hq"}' http://localhost:6010/v1/dataprep
+curl -X POST -H "Content-Type: application/json" -d '{"path":"/home/user/doc/your_document_name","process_table":true,"table_strategy":"hq"}' http://localhost:6007/v1/dataprep
 ```
 
 We support table extraction from pdf documents. You can specify process_table and table_strategy by the following commands. "table_strategy" refers to the strategies to understand tables for table retrieval. As the setting progresses from "fast" to "hq" to "llm," the focus shifts towards deeper table understanding at the expense of processing speed. The default strategy is "fast".
 
 Note: If you specify "table_strategy=llm", You should first start TGI Service, please refer to 1.2.1, 1.3.1 in https://github.com/opea-project/GenAIComps/tree/main/comps/llms/README.md, and then `export TGI_LLM_ENDPOINT="http://${your_ip}:8008"`.
 
 ```bash
-curl -X POST -H "Content-Type: application/json" -d '{"path":"/home/user/doc/your_document_name","process_table":true,"table_strategy":"hq"}' http://localhost:6010/v1/dataprep
+curl -X POST -H "Content-Type: application/json" -d '{"path":"/home/user/doc/your_document_name","process_table":true,"table_strategy":"hq"}' http://localhost:6007/v1/dataprep
 ```
 
 ### 3.2 Consume get_file API
@@ -191,7 +149,7 @@ To get uploaded file structures, use the following command:
 ```bash
 curl -X POST \
     -H "Content-Type: application/json" \
-    http://localhost:6010/v1/dataprep/get_file
+    http://localhost:6007/v1/dataprep/get_file
 ```
 
 Then you will get the response JSON like this:
@@ -224,19 +182,19 @@ The `file_path` here should be the `id` get from `/v1/dataprep/get_file` API.
 curl -X POST \
     -H "Content-Type: application/json" \
     -d '{"file_path": "https://www.ces.tech/.txt"}' \
-    http://localhost:6010/v1/dataprep/delete_file
+    http://localhost:6007/v1/dataprep/delete_file
 
 # delete file
 curl -X POST \
     -H "Content-Type: application/json" \
     -d '{"file_path": "uploaded_file_1.txt"}' \
-    http://localhost:6010/v1/dataprep/delete_file
+    http://localhost:6007/v1/dataprep/delete_file
 
 # delete all files and links, will drop the entire db collection
 curl -X POST \
     -H "Content-Type: application/json" \
     -d '{"file_path": "all"}' \
-    http://localhost:6010/v1/dataprep/delete_file
+    http://localhost:6007/v1/dataprep/delete_file
 ```
 
 ## 🚀4. Troubleshooting
@@ -248,5 +206,5 @@ curl -X POST \
        -H "Content-Type: multipart/form-data" \
        -F "files=@./file.pdf" \
        -F "chunk_size=500" \
-       http://localhost:6010/v1/dataprep
+       http://localhost:6007/v1/dataprep
    ```
@@ -1,8 +1,6 @@
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 
-version: '3.5'
-
 services:
   etcd:
     container_name: milvus-etcd
@@ -13,7 +11,7 @@ services:
       - ETCD_QUOTA_BACKEND_BYTES=4294967296
       - ETCD_SNAPSHOT_COUNT=50000
     volumes:
-      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
+      - ./volumes/etcd:/etcd
     command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
     healthcheck:
       test: ["CMD", "etcdctl", "endpoint", "health"]
@@ -31,7 +29,7 @@ services:
       - "5044:9001"
       - "5043:9000"
     volumes:
-      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
+      - ./volumes/minio:/minio_data
     command: minio server /minio_data --console-address ":9001"
     healthcheck:
       test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
@@ -49,8 +47,8 @@ services:
       ETCD_ENDPOINTS: etcd:2379
       MINIO_ADDRESS: minio:9000
     volumes:
-      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
       - ${DOCKER_VOLUME_DIRECTORY:-.}/milvus.yaml:/milvus/configs/milvus.yaml
+      - ./volumes/milvus:/var/lib/milvus
     healthcheck:
       test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
       interval: 30s
@@ -64,6 +62,43 @@ services:
       - "etcd"
       - "minio"
 
+  dataprep-redis-service:
+    image: ${REGISTRY:-opea}/dataprep-milvus:${TAG:-latest}
+    build:
+      context: ../../../..
+      dockerfile: ./comps/dataprep/milvus/langchain/Dockerfile
+      args:
+        http_proxy: ${http_proxy}
+        https_proxy: ${https_proxy}
+    container_name: test-comps-dataprep-milvus-server
+    ports:
+      - "6007:6007"
+    depends_on:
+      - standalone
+      - tei-embedding-service
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      MILVUS_HOST: ${MILVUS_HOST}
+      MILVUS_PORT: ${MILVUS_PORT}
+      TEI_EMBEDDING_ENDPOINT: http://tei-embedding-service:80
+      HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
+
+  tei-embedding-service:
+    image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.5
+    container_name: tei-embedding-server
+    ports:
+      - "6006:80"
+    volumes:
+      - "./data:/data"
+    shm_size: 1g
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+    command: --model-id ${EMBEDDING_MODEL_ID} --auto-truncate
+
 networks:
   default:
-    name: milvus
+    driver: bridge
@@ -88,11 +88,12 @@ def ingest_chunks_to_milvus(file_name: str, chunks: List):
         batch_docs = insert_docs[i : i + batch_size]
 
         try:
+            url = "http://" + str(MILVUS_HOST) + ":" + str(MILVUS_PORT)
             _ = Milvus.from_documents(
                 batch_docs,
                 embeddings,
                 collection_name=COLLECTION_NAME,
-                connection_args={"uri": milvus_uri},
+                connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "uri": url},
                 partition_key_field=partition_field_name,
             )
         except Exception as e:
@@ -190,7 +191,7 @@ def delete_by_partition_field(my_milvus, partition_field):
         logger.info(f"[ delete partition ] delete success: {res}")
 
 
-@register_microservice(name="opea_service@prepare_doc_milvus", endpoint="/v1/dataprep", host="0.0.0.0", port=6010)
+@register_microservice(name="opea_service@prepare_doc_milvus", endpoint="/v1/dataprep", host="0.0.0.0", port=6007)
 async def ingest_documents(
     files: Optional[Union[UploadFile, List[UploadFile]]] = File(None),
     link_list: Optional[str] = Form(None),
@@ -207,10 +208,11 @@ async def ingest_documents(
         raise HTTPException(status_code=400, detail="Provide either a file or a string list, not both.")
 
     # define Milvus obj
+    url = "http://" + str(MILVUS_HOST) + ":" + str(MILVUS_PORT)
     my_milvus = Milvus(
         embedding_function=embeddings,
         collection_name=COLLECTION_NAME,
-        connection_args={"uri": milvus_uri},
+        connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "uri": url},
         index_params=index_params,
         auto_id=True,
     )
@@ -336,17 +338,18 @@ async def ingest_documents(
 
 
 @register_microservice(
-    name="opea_service@prepare_doc_milvus", endpoint="/v1/dataprep/get_file", host="0.0.0.0", port=6010
+    name="opea_service@prepare_doc_milvus", endpoint="/v1/dataprep/get_file", host="0.0.0.0", port=6007
 )
 async def rag_get_file_structure():
     if logflag:
         logger.info("[ get ] start to get file structure")
 
     # define Milvus obj
+    url = "http://" + str(MILVUS_HOST) + ":" + str(MILVUS_PORT)
     my_milvus = Milvus(
         embedding_function=embeddings,
         collection_name=COLLECTION_NAME,
-        connection_args={"uri": milvus_uri},
+        connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "uri": url},
         index_params=index_params,
         auto_id=True,
     )
@@ -388,7 +391,7 @@ async def rag_get_file_structure():
 
 
 @register_microservice(
-    name="opea_service@prepare_doc_milvus", endpoint="/v1/dataprep/delete_file", host="0.0.0.0", port=6010
+    name="opea_service@prepare_doc_milvus", endpoint="/v1/dataprep/delete_file", host="0.0.0.0", port=6007
 )
 async def delete_single_file(file_path: str = Body(..., embed=True)):
     """Delete file according to `file_path`.
@@ -401,10 +404,11 @@ async def delete_single_file(file_path: str = Body(..., embed=True)):
         logger.info(file_path)
 
     # define Milvus obj
+    url = "http://" + str(MILVUS_HOST) + ":" + str(MILVUS_PORT)
     my_milvus = Milvus(
         embedding_function=embeddings,
         collection_name=COLLECTION_NAME,
-        connection_args={"uri": milvus_uri},
+        connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "uri": url},
         index_params=index_params,
         auto_id=True,
     )