siddhivelankar23 · sjagtap1803 · Jul 9, 2024 · Jul 9, 2024 · Jul 9, 2024 · Jul 10, 2024
diff --git a/comps/dataprep/multimodal_utils.py b/comps/dataprep/multimodal_utils.py
diff --git a/comps/dataprep/redis/README.md b/comps/dataprep/redis/README.md
@@ -4,7 +4,9 @@ For dataprep microservice, we provide two frameworks: `Langchain` and `LlamaInde
 
 We organized these two folders in the same way, so you can use either framework for dataprep microservice with the following constructions.
 
-## 🚀1. Start Microservice with Python（Option 1）
+Instructions for multimodal data preparation can be found in the `multimodal_langchain` directory.
+
+# 🚀1. Start Microservice with Python（Option 1）
 
 ### 1.1 Install Requirements
 

diff --git a/comps/dataprep/redis/multimodal_langchain/README.md b/comps/dataprep/redis/multimodal_langchain/README.md
@@ -0,0 +1,190 @@
+# Dataprep Microservice for Multimodal Data with Redis
+
+This dataprep microservice accepts videos (mp4 files) from the user and ingests data into Redis vectorstore with the help of transcripts and captions.
+
+For videos without audio or recognizable speech, LVM is used to generate captions for video frames. To leverage LVM, please refer to this [readme](../../../lvms/README.md) to start the LVM microservice first before starting this microservice.
+
+# 🚀1. Start Microservice with Python（Option 1）
+
+## 1.1 Install Requirements
+
+```bash
+apt update
+apt install default-jre
+
+# Install ffmpeg static build
+wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz
+mkdir ffmpeg-git-amd64-static
+tar -xvf ffmpeg-git-amd64-static.tar.xz -C ffmpeg-git-amd64-static --strip-components 1
+export PATH=$(pwd)/ffmpeg-git-amd64-static:$PATH
+cp $(pwd)/ffmpeg-git-amd64-static/ffmpeg /usr/local/bin/
+
+pip install -r requirements.txt
+```
+
+## 1.2 Start Redis Stack Server
+
+Please refer to this [readme](../../../vectorstores/langchain/redis/README.md).
+
+## 1.3 Setup Environment Variables
+
+```bash
+export REDIS_URL="redis://${your_ip}:6379"
+export INDEX_NAME=${your_index_name}
+export PYTHONPATH=${path_to_comps}
+```
+
+## 1.4 Start LVM Microservice
+
+Please refer to this [readme](../../../lvms/README.md) to start the LVM microservice.
+
+After LVM is up, set up environment variables.
+
+```bash
+export LVM_ENDPOINT="http://localhost:9399/v1/lvm"
+```
+
+## 1.5 Start Document Preparation Microservice for Redis with Python Script
+
+Start document preparation microservice for Redis with below command.
+
+```bash
+python prepare_videodoc_redis.py
+```
+
+# 🚀2. Start Microservice with Docker (Option 2)
+
+## 2.1 Start Redis Stack Server
+
+Please refer to this [readme](../../../vectorstores/langchain/redis/README.md).
+
+## 2.2 Setup Environment Variables
+
+```bash
+export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
+export LVM_ENDPOINT="http://${your_ip}:9399/v1/lvm"
+export REDIS_URL="redis://${your_ip}:6379"
+export INDEX_NAME=${your_index_name}
+export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
+```
+
+## 2.3 Build Docker Image
+
+```bash
+cd ../../../../../
+docker build -t opea/dataprep-redis:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/redis/multimodal_langchain/docker/Dockerfile .
+```
+
+## 2.4 Run Docker with CLI (Option A)
+
+```bash
+docker run -d --name="dataprep-redis-server" -p 6007:6007 --runtime=runc --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e REDIS_URL=$REDIS_URL -e INDEX_NAME=$INDEX_NAME -e LVM_ENDPOINT=$LVM_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN opea/dataprep-redis:latest
+```
+
+## 2.5 Run with Docker Compose (Option B - deprecated, will move to genAIExample in future)
+
+```bash
+cd comps/dataprep/redis/multimodal_langchain/docker
+docker compose -f docker-compose-dataprep-redis.yaml up -d
+```
+
+# 🚀3. Status Microservice
+
+```bash
+docker container logs -f dataprep-redis-server
+```
+
+# 🚀4. Consume Microservice
+
+## 4.1 Consume videos_with_transcripts API
+
+Once document preparation microservice for Redis is started, user can use below command to invoke the microservice to convert videos and their transcripts to embeddings and save to the database.
+
+Make sure the file path after `files=@` is correct.
+
+- Single video-transcript pair upload
+
+```bash
+curl -X POST \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./video1.mp4" \
+    -F "files=@./video1.vtt" \
+    http://dataprep-redis-service:6007/v1/dataprep/videos_with_transcripts
+```
+
+- Multiple video-transcript pair upload
+```bash
+curl -X POST \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./video1.mp4" \
+    -F "files=@./video1.vtt" \
+    -F "files=@./video2.mp4" \
+    -F "files=@./video2.vtt" \
+    http://dataprep-redis-service:6007/v1/dataprep/videos_with_transcripts
+```
+
+## 4.2 Consume generate_transcripts API
+
+If transcripts are not available for videos, transcripts will be extracted from them. The user can use below command to invoke the microservice to convert videos and their extracted transcripts to embeddings and save to the database.
+
+- Single video upload
+
+```bash
+curl -X POST \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./video1.mp4" \
+    http://dataprep-redis-service:6007/v1/dataprep/generate_transcripts
+```
+
+- Multiple video upload
+
+```bash
+curl -X POST \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./video1.mp4" \
+    -F "files=@./video2.mp4" \
+    http://dataprep-redis-service:6007/v1/dataprep/generate_transcripts
+```
+
+## 4.3 Consume generate_captions API
+
+If uploaded videos lack audio or recognizable speech, captions will be generated for frames using LVM. The user can use below command to invoke the microservice to convert videos and generated captions to embeddings and save to the database.
+
+- Single video upload
+
+```bash
+curl -X POST \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./video1.mp4" \
+    http://dataprep-redis-service:6007/v1/dataprep/generate_captions
+```
+
+- Multiple video upload
+
+```bash
+curl -X POST \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./video1.mp4" \
+    -F "files=@./video2.mp4" \
+    http://dataprep-redis-service:6007/v1/dataprep/generate_captions
+```
+
+## 4.4 Consume get_videos API
+
+To get names of uploaded videos, use the following command.
+
+```bash
+curl -X POST \
+    -H "Content-Type: application/json" \
+    http://dataprep-redis-service:6007/v1/dataprep/get_videos
+```
+
+## 4.5 Consume delete_videos API
+
+To delete uploaded videos and clear the database, use the following command.
+
+```bash
+curl -X POST \
+    -H "Content-Type: application/json" \
+    http://dataprep-redis-service:6007/v1/dataprep/delete_videos
+```
diff --git a/comps/dataprep/redis/multimodal_langchain/__init__.py b/comps/dataprep/redis/multimodal_langchain/__init__.py
@@ -0,0 +1,2 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
diff --git a/comps/dataprep/redis/multimodal_langchain/config.py b/comps/dataprep/redis/multimodal_langchain/config.py
@@ -0,0 +1,71 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import os
+
+# Models
+EMBED_MODEL = os.getenv("EMBED_MODEL", "BridgeTower/bridgetower-large-itm-mlm-itc")
+WHISPER_MODEL = os.getenv("WHISPER_MODEL", "large-v2")
+
+# Redis Connection Information
+REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
+REDIS_PORT = int(os.getenv("REDIS_PORT", 6379))
+
+# Lvm Microservice Information
+LVM_ENDPOINT=os.getenv("LVM_ENDPOINT", "http://localhost:9399/v1/lvm")
+
+
+def get_boolean_env_var(var_name, default_value=False):
+    """Retrieve the boolean value of an environment variable.
+
+    Args:
+    var_name (str): The name of the environment variable to retrieve.
+    default_value (bool): The default value to return if the variable
+    is not found.
+
+    Returns:
+    bool: The value of the environment variable, interpreted as a boolean.
+    """
+    true_values = {"true", "1", "t", "y", "yes"}
+    false_values = {"false", "0", "f", "n", "no"}
+
+    # Retrieve the environment variable's value
+    value = os.getenv(var_name, "").lower()
+
+    # Decide the boolean value based on the content of the string
+    if value in true_values:
+        return True
+    elif value in false_values:
+        return False
+    else:
+        return default_value
+
+
+def format_redis_conn_from_env():
+    redis_url = os.getenv("REDIS_URL", None)
+    if redis_url:
+        return redis_url
+    else:
+        using_ssl = get_boolean_env_var("REDIS_SSL", False)
+        start = "rediss://" if using_ssl else "redis://"
+
+        # if using RBAC
+        password = os.getenv("REDIS_PASSWORD", None)
+        username = os.getenv("REDIS_USERNAME", "default")
+        if password is not None:
+            start += f"{username}:{password}@"
+
+        return start + f"{REDIS_HOST}:{REDIS_PORT}"
+
+
+REDIS_URL = format_redis_conn_from_env()
+
+# Vector Index Configuration
+INDEX_NAME = os.getenv("INDEX_NAME", "mm-rag-redis")
+
+current_file_path = os.path.abspath(__file__)
+parent_dir = os.path.dirname(current_file_path)
+REDIS_SCHEMA = os.getenv("REDIS_SCHEMA", "schema.yml")
+TIMEOUT_SECONDS = int(os.getenv("TIMEOUT_SECONDS", 600))
+schema_path = os.path.join(parent_dir, REDIS_SCHEMA)
+INDEX_SCHEMA = schema_path
diff --git a/comps/dataprep/redis/multimodal_langchain/docker/Dockerfile b/comps/dataprep/redis/multimodal_langchain/docker/Dockerfile
@@ -0,0 +1,48 @@
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+FROM python:3.11-slim
+
+ENV LANG=C.UTF-8
+
+ARG ARCH="cpu"
+
+RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
+    build-essential \
+    libgl1-mesa-glx \
+    libjemalloc-dev \
+    default-jre \
+    wget \
+    vim
+
+# Install ffmpeg static build
+RUN cd /root && wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz && \
+    mkdir ffmpeg-git-amd64-static && tar -xvf ffmpeg-git-amd64-static.tar.xz -C ffmpeg-git-amd64-static --strip-components 1 && \
+    export PATH=/root/ffmpeg-git-amd64-static:$PATH && \
+    cp /root/ffmpeg-git-amd64-static/ffmpeg /usr/local/bin/
+
+RUN useradd -m -s /bin/bash user && \
+    mkdir -p /home/user && \
+    chown -R user /home/user/
+
+USER user
+
+COPY comps /home/user/comps
+
+RUN pip install --no-cache-dir --upgrade pip setuptools && \
+    if [ ${ARCH} = "cpu" ]; then pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
+    pip install --no-cache-dir -r /home/user/comps/dataprep/redis/multimodal_langchain/requirements.txt
+
+ENV PYTHONPATH=$PYTHONPATH:/home/user
+
+USER root
+
+RUN mkdir -p /home/user/comps/dataprep/redis/multimodal_langchain/uploaded_files && chown -R user /home/user/comps/dataprep/redis/multimodal_langchain/uploaded_files
+
+USER user
+
+WORKDIR /home/user/comps/dataprep/redis/multimodal_langchain
+
+ENTRYPOINT ["python", "prepare_videodoc_redis.py"]
+
diff --git a/comps/dataprep/redis/multimodal_langchain/docker/docker-compose-dataprep-redis.yaml b/comps/dataprep/redis/multimodal_langchain/docker/docker-compose-dataprep-redis.yaml
@@ -0,0 +1,30 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+version: "3"
+services:
+  redis-vector-db:
+    image: redis/redis-stack:7.2.0-v9
+    container_name: redis-vector-db
+    ports:
+      - "6379:6379"
+      - "8001:8001"
+  dataprep-redis:
+    image: opea/dataprep-redis:latest
+    container_name: dataprep-redis-server
+    ports:
+      - "6007:6007"
+    ipc: host
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      REDIS_URL: ${REDIS_URL}
+      INDEX_NAME: ${INDEX_NAME}
+      LANGCHAIN_API_KEY: ${LANGCHAIN_API_KEY}
+      LVM_ENDPOINT: ${LVM_ENDPOINT}
+    restart: unless-stopped
+
+networks:
+  default:
+    driver: bridge