Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data prep microservice for multi modal RAG #1

Open
wants to merge 89 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
16a6645
add BridgeTowerEmbeddings.py
siddhivelankar23 Jul 9, 2024
304bcd3
add bridgetower_custom.py
siddhivelankar23 Jul 9, 2024
afc5014
add file
siddhivelankar23 Jul 9, 2024
0a22d6e
multimodal embedding microservice
siddhivelankar23 Jul 10, 2024
4aa991f
add class ImageDoc
siddhivelankar23 Jul 10, 2024
1e48b6d
Merge branch 'opea-project:main' into main
siddhivelankar23 Jul 12, 2024
5ec7d39
add Dockerfile
siddhivelankar23 Jul 12, 2024
7437fe4
add docker compose
siddhivelankar23 Jul 12, 2024
efe3753
add text + image doc
siddhivelankar23 Jul 16, 2024
8bd9192
add TextImageDoc
siddhivelankar23 Jul 16, 2024
3863de3
add mm_retriever_redis.py
siddhivelankar23 Jul 16, 2024
768172e
add ValueError for invalid input
siddhivelankar23 Jul 16, 2024
b426aef
add imports
siddhivelankar23 Jul 16, 2024
55bdd42
format
siddhivelankar23 Jul 16, 2024
f3f9380
Merge branch 'opea-project:main' into main
siddhivelankar23 Jul 16, 2024
a1a8d48
Merge branch 'opea-project:main' into main
siddhivelankar23 Jul 18, 2024
27dc494
add SearchedMultimodalDoc class
siddhivelankar23 Jul 19, 2024
eb49cae
Update mm_retriever_redis.py
siddhivelankar23 Jul 19, 2024
43ee30d
Merge branch 'opea-project:main' into main
siddhivelankar23 Jul 22, 2024
c3b268d
basic utils for transcript extraction
sjagtap1803 Jul 27, 2024
a5469cb
basic utils for creating frames and annotations
sjagtap1803 Jul 27, 2024
a27500e
removed unused variables and functions
sjagtap1803 Jul 27, 2024
461c204
moved multimodal utils to separate file
sjagtap1803 Jul 28, 2024
93aa7c5
added redis schema and requirements
sjagtap1803 Jul 28, 2024
5fb35d9
included bridgetower embeddings classes in multimodal utils
sjagtap1803 Jul 28, 2024
1a82fd0
set up config for multimodal
sjagtap1803 Jul 28, 2024
7c5d784
defined microservice endpoints
sjagtap1803 Jul 29, 2024
c9bbf36
update requirements and add docker dir
sjagtap1803 Jul 29, 2024
7f319a5
renamed files
sjagtap1803 Jul 29, 2024
9e5f438
fixed some bugs
sjagtap1803 Jul 29, 2024
3aa6a20
fixed ingest files endpoint
sjagtap1803 Jul 29, 2024
e4a98b5
moved bridgetower classes to prepare_doc_redis.py
sjagtap1803 Jul 29, 2024
9822424
Merge branch 'opea-project:main' into main
siddhivelankar23 Jul 30, 2024
f4acedf
BridgeTowerEmbeddings to MMEmbeddings
siddhivelankar23 Jul 30, 2024
bdc2e80
add metadata
siddhivelankar23 Jul 30, 2024
d4f69fd
fixed embeddings initialization issue
sjagtap1803 Jul 30, 2024
da3d4fb
renamed embeddings variable
sjagtap1803 Jul 30, 2024
70a54b9
Rename BridgeTowerEmbeddings.py to MMEmbeddings.py
siddhivelankar23 Jul 31, 2024
4880393
update embeddings
siddhivelankar23 Jul 31, 2024
02bcf13
defined separate endpoints for videos and videos with captions
sjagtap1803 Aug 5, 2024
cf5e8df
update docker compose
sjagtap1803 Aug 5, 2024
afe684c
fixed seg faults on server
sjagtap1803 Aug 5, 2024
58f55bc
fixed bugs in transcript generation
sjagtap1803 Aug 5, 2024
2b46d33
Merge branch 'opea-project:main' into main
siddhivelankar23 Aug 6, 2024
987437a
move mm_embedding.py
siddhivelankar23 Aug 6, 2024
a35c7f0
Delete comps/embeddings/langchain/mm_embedding.py
siddhivelankar23 Aug 6, 2024
e4220ed
move Dockerfile
siddhivelankar23 Aug 6, 2024
8fabe05
move docker_compose_embedding.yaml
siddhivelankar23 Aug 6, 2024
4e8a794
Delete comps/embeddings/langchain/mm_docker directory
siddhivelankar23 Aug 6, 2024
244bd86
add __init__.py
siddhivelankar23 Aug 6, 2024
7742aa3
add requirements.txt
siddhivelankar23 Aug 6, 2024
8313588
change paths
siddhivelankar23 Aug 6, 2024
4d7e974
remove unwanted variables
siddhivelankar23 Aug 6, 2024
d868728
move MMEmbeddings
siddhivelankar23 Aug 6, 2024
667f601
move custom embeddings file
siddhivelankar23 Aug 6, 2024
b922e9b
Delete comps/embeddings/langchain/BridgeTowerCustom directory
siddhivelankar23 Aug 6, 2024
3654c80
Delete comps/embeddings/langchain/MMEmbeddings.py
siddhivelankar23 Aug 6, 2024
3b22454
add model name
siddhivelankar23 Aug 6, 2024
059164b
Update mm_embedding.py
siddhivelankar23 Aug 6, 2024
bc20a97
change struct
siddhivelankar23 Aug 6, 2024
1272125
Merge branch 'opea-project:main' into main
siddhivelankar23 Aug 6, 2024
de3f844
Merge branch 'opea-project:main' into main
siddhivelankar23 Aug 6, 2024
dc114a1
update config and model initialization
sjagtap1803 Aug 6, 2024
4a14096
generate captions using lvm microservice
sjagtap1803 Aug 6, 2024
abbbe93
changed llava prompt and endpoint variables
sjagtap1803 Aug 6, 2024
35bdd02
update Dockerfile with ffmpeg build
sjagtap1803 Aug 6, 2024
fb72026
made minor changes in entrypoint script
sjagtap1803 Aug 6, 2024
7791c43
run embeddings model on cpu for stability
sjagtap1803 Aug 7, 2024
326b337
fixed paths in Dockerfile
sjagtap1803 Aug 7, 2024
fe9fdec
fixed bug in delete endpoint
sjagtap1803 Aug 7, 2024
bc8ae4b
Merge remote-tracking branch 'upstream/main' into sjagtap-data-prep
sjagtap1803 Aug 7, 2024
94900c8
renamed files
sjagtap1803 Aug 7, 2024
7663f38
removed lvm microservice from docker compose
sjagtap1803 Aug 7, 2024
73d30f9
update annotations schema and delete frames and annotations json
sjagtap1803 Aug 8, 2024
10276b9
Merge remote-tracking branch 'upstream/main' into sjagtap-data-prep
sjagtap1803 Aug 8, 2024
0f1f88b
raise http exceptions for invalid inputs
sjagtap1803 Aug 8, 2024
29c7bf7
update annotations and metadate fields
sjagtap1803 Aug 8, 2024
f9ecd99
fixed some minor bugs
sjagtap1803 Aug 8, 2024
1e63e5a
Merge branch 'opea-project:main' into main
siddhivelankar23 Aug 9, 2024
2305c87
fixed redis schema yaml
sjagtap1803 Aug 9, 2024
af89703
Merge remote-tracking branch 'upstream/main' into sjagtap-data-prep
sjagtap1803 Aug 9, 2024
c16d9ce
Merge remote-tracking branch 'origin/main' into sjagtap-data-prep
sjagtap1803 Aug 9, 2024
e73d007
Delete comps/embeddings/langchain_multimodal directory
siddhivelankar23 Aug 9, 2024
971a9ab
Delete comps/retrievers/langchain/redis/mm_retriever_redis.py
siddhivelankar23 Aug 9, 2024
d4a8cf0
sync with https://github.com/opea-project/GenAIComps.git
siddhivelankar23 Aug 9, 2024
dcc67d7
Merge remote-tracking branch 'upstream/main' into sjagtap-data-prep
sjagtap1803 Aug 13, 2024
40abe27
use single endpoint for all endpoints
sjagtap1803 Aug 13, 2024
cf4bda4
created README for multimodal langchain dataprep with redis
sjagtap1803 Aug 13, 2024
02cbf18
Merge branch 'main' into sjagtap-data-prep
tileintel Aug 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
469 changes: 469 additions & 0 deletions comps/dataprep/multimodal_utils.py

Large diffs are not rendered by default.

4 changes: 3 additions & 1 deletion comps/dataprep/redis/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@ For dataprep microservice, we provide two frameworks: `Langchain` and `LlamaInde

We organized these two folders in the same way, so you can use either framework for dataprep microservice with the following constructions.

## 🚀1. Start Microservice with Python(Option 1)
Instructions for multimodal data preparation can be found in the `multimodal_langchain` directory.

# 🚀1. Start Microservice with Python(Option 1)

### 1.1 Install Requirements

Expand Down
190 changes: 190 additions & 0 deletions comps/dataprep/redis/multimodal_langchain/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
# Dataprep Microservice for Multimodal Data with Redis

This dataprep microservice accepts videos (mp4 files) from the user and ingests data into Redis vectorstore with the help of transcripts and captions.

For videos without audio or recognizable speech, LVM is used to generate captions for video frames. To leverage LVM, please refer to this [readme](../../../lvms/README.md) to start the LVM microservice first before starting this microservice.

# 🚀1. Start Microservice with Python(Option 1)

## 1.1 Install Requirements

```bash
apt update
apt install default-jre

# Install ffmpeg static build
wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz
mkdir ffmpeg-git-amd64-static
tar -xvf ffmpeg-git-amd64-static.tar.xz -C ffmpeg-git-amd64-static --strip-components 1
export PATH=$(pwd)/ffmpeg-git-amd64-static:$PATH
cp $(pwd)/ffmpeg-git-amd64-static/ffmpeg /usr/local/bin/

pip install -r requirements.txt
```

## 1.2 Start Redis Stack Server

Please refer to this [readme](../../../vectorstores/langchain/redis/README.md).

## 1.3 Setup Environment Variables

```bash
export REDIS_URL="redis://${your_ip}:6379"
export INDEX_NAME=${your_index_name}
export PYTHONPATH=${path_to_comps}
```

## 1.4 Start LVM Microservice

Please refer to this [readme](../../../lvms/README.md) to start the LVM microservice.

After LVM is up, set up environment variables.

```bash
export LVM_ENDPOINT="http://localhost:9399/v1/lvm"
```

## 1.5 Start Document Preparation Microservice for Redis with Python Script

Start document preparation microservice for Redis with below command.

```bash
python prepare_videodoc_redis.py
```

# 🚀2. Start Microservice with Docker (Option 2)

## 2.1 Start Redis Stack Server

Please refer to this [readme](../../../vectorstores/langchain/redis/README.md).

## 2.2 Setup Environment Variables

```bash
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
export LVM_ENDPOINT="http://${your_ip}:9399/v1/lvm"
export REDIS_URL="redis://${your_ip}:6379"
export INDEX_NAME=${your_index_name}
export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
```

## 2.3 Build Docker Image

```bash
cd ../../../../../
docker build -t opea/dataprep-redis:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/redis/multimodal_langchain/docker/Dockerfile .
```

## 2.4 Run Docker with CLI (Option A)

```bash
docker run -d --name="dataprep-redis-server" -p 6007:6007 --runtime=runc --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e REDIS_URL=$REDIS_URL -e INDEX_NAME=$INDEX_NAME -e LVM_ENDPOINT=$LVM_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN opea/dataprep-redis:latest
```

## 2.5 Run with Docker Compose (Option B - deprecated, will move to genAIExample in future)

```bash
cd comps/dataprep/redis/multimodal_langchain/docker
docker compose -f docker-compose-dataprep-redis.yaml up -d
```

# 🚀3. Status Microservice

```bash
docker container logs -f dataprep-redis-server
```

# 🚀4. Consume Microservice

## 4.1 Consume videos_with_transcripts API

Once document preparation microservice for Redis is started, user can use below command to invoke the microservice to convert videos and their transcripts to embeddings and save to the database.

Make sure the file path after `files=@` is correct.

- Single video-transcript pair upload

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./video1.mp4" \
-F "files=@./video1.vtt" \
http://dataprep-redis-service:6007/v1/dataprep/videos_with_transcripts
```

- Multiple video-transcript pair upload
```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./video1.mp4" \
-F "files=@./video1.vtt" \
-F "files=@./video2.mp4" \
-F "files=@./video2.vtt" \
http://dataprep-redis-service:6007/v1/dataprep/videos_with_transcripts
```

## 4.2 Consume generate_transcripts API

If transcripts are not available for videos, transcripts will be extracted from them. The user can use below command to invoke the microservice to convert videos and their extracted transcripts to embeddings and save to the database.

- Single video upload

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./video1.mp4" \
http://dataprep-redis-service:6007/v1/dataprep/generate_transcripts
```

- Multiple video upload

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./video1.mp4" \
-F "files=@./video2.mp4" \
http://dataprep-redis-service:6007/v1/dataprep/generate_transcripts
```

## 4.3 Consume generate_captions API

If uploaded videos lack audio or recognizable speech, captions will be generated for frames using LVM. The user can use below command to invoke the microservice to convert videos and generated captions to embeddings and save to the database.

- Single video upload

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./video1.mp4" \
http://dataprep-redis-service:6007/v1/dataprep/generate_captions
```

- Multiple video upload

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./video1.mp4" \
-F "files=@./video2.mp4" \
http://dataprep-redis-service:6007/v1/dataprep/generate_captions
```

## 4.4 Consume get_videos API

To get names of uploaded videos, use the following command.

```bash
curl -X POST \
-H "Content-Type: application/json" \
http://dataprep-redis-service:6007/v1/dataprep/get_videos
```

## 4.5 Consume delete_videos API

To delete uploaded videos and clear the database, use the following command.

```bash
curl -X POST \
-H "Content-Type: application/json" \
http://dataprep-redis-service:6007/v1/dataprep/delete_videos
```
2 changes: 2 additions & 0 deletions comps/dataprep/redis/multimodal_langchain/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
71 changes: 71 additions & 0 deletions comps/dataprep/redis/multimodal_langchain/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import os

# Models
EMBED_MODEL = os.getenv("EMBED_MODEL", "BridgeTower/bridgetower-large-itm-mlm-itc")
WHISPER_MODEL = os.getenv("WHISPER_MODEL", "large-v2")

# Redis Connection Information
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = int(os.getenv("REDIS_PORT", 6379))

# Lvm Microservice Information
LVM_ENDPOINT=os.getenv("LVM_ENDPOINT", "http://localhost:9399/v1/lvm")


def get_boolean_env_var(var_name, default_value=False):
"""Retrieve the boolean value of an environment variable.

Args:
var_name (str): The name of the environment variable to retrieve.
default_value (bool): The default value to return if the variable
is not found.

Returns:
bool: The value of the environment variable, interpreted as a boolean.
"""
true_values = {"true", "1", "t", "y", "yes"}
false_values = {"false", "0", "f", "n", "no"}

# Retrieve the environment variable's value
value = os.getenv(var_name, "").lower()

# Decide the boolean value based on the content of the string
if value in true_values:
return True
elif value in false_values:
return False
else:
return default_value


def format_redis_conn_from_env():
redis_url = os.getenv("REDIS_URL", None)
if redis_url:
return redis_url
else:
using_ssl = get_boolean_env_var("REDIS_SSL", False)
start = "rediss://" if using_ssl else "redis://"

# if using RBAC
password = os.getenv("REDIS_PASSWORD", None)
username = os.getenv("REDIS_USERNAME", "default")
if password is not None:
start += f"{username}:{password}@"

return start + f"{REDIS_HOST}:{REDIS_PORT}"


REDIS_URL = format_redis_conn_from_env()

# Vector Index Configuration
INDEX_NAME = os.getenv("INDEX_NAME", "mm-rag-redis")

current_file_path = os.path.abspath(__file__)
parent_dir = os.path.dirname(current_file_path)
REDIS_SCHEMA = os.getenv("REDIS_SCHEMA", "schema.yml")
TIMEOUT_SECONDS = int(os.getenv("TIMEOUT_SECONDS", 600))
schema_path = os.path.join(parent_dir, REDIS_SCHEMA)
INDEX_SCHEMA = schema_path
48 changes: 48 additions & 0 deletions comps/dataprep/redis/multimodal_langchain/docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@

# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

FROM python:3.11-slim

ENV LANG=C.UTF-8

ARG ARCH="cpu"

RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
build-essential \
libgl1-mesa-glx \
libjemalloc-dev \
default-jre \
wget \
vim

# Install ffmpeg static build
RUN cd /root && wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz && \
mkdir ffmpeg-git-amd64-static && tar -xvf ffmpeg-git-amd64-static.tar.xz -C ffmpeg-git-amd64-static --strip-components 1 && \
export PATH=/root/ffmpeg-git-amd64-static:$PATH && \
cp /root/ffmpeg-git-amd64-static/ffmpeg /usr/local/bin/

RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/

USER user

COPY comps /home/user/comps

RUN pip install --no-cache-dir --upgrade pip setuptools && \
if [ ${ARCH} = "cpu" ]; then pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
pip install --no-cache-dir -r /home/user/comps/dataprep/redis/multimodal_langchain/requirements.txt

ENV PYTHONPATH=$PYTHONPATH:/home/user

USER root

RUN mkdir -p /home/user/comps/dataprep/redis/multimodal_langchain/uploaded_files && chown -R user /home/user/comps/dataprep/redis/multimodal_langchain/uploaded_files

USER user

WORKDIR /home/user/comps/dataprep/redis/multimodal_langchain

ENTRYPOINT ["python", "prepare_videodoc_redis.py"]

Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

version: "3"
services:
redis-vector-db:
image: redis/redis-stack:7.2.0-v9
container_name: redis-vector-db
ports:
- "6379:6379"
- "8001:8001"
dataprep-redis:
image: opea/dataprep-redis:latest
container_name: dataprep-redis-server
ports:
- "6007:6007"
ipc: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
REDIS_URL: ${REDIS_URL}
INDEX_NAME: ${INDEX_NAME}
LANGCHAIN_API_KEY: ${LANGCHAIN_API_KEY}
LVM_ENDPOINT: ${LVM_ENDPOINT}
restart: unless-stopped

networks:
default:
driver: bridge
Loading