Merge branch 'intel:main' into main

XinyaoWa · Jan 31, 2024 · 62b6d8d · 62b6d8d
2 parents 8e670e0 + bfc914a
commit 62b6d8d
Show file tree

Hide file tree

Showing 208 changed files with 11,269 additions and 2,161 deletions.
diff --git a/.github/workflows/e2eaiok_recdp_release_pypi.yml b/.github/workflows/e2eaiok_recdp_release_pypi.yml
@@ -0,0 +1,37 @@
+name: Publish RecDP Stable Release to PyPI
+
+on:
+  workflow_dispatch:
+  push:
+    branches:
+      - main
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  e2eaiok-release-python-pypi:
+    runs-on: self-hosted
+    if: ${{ github.repository_owner == 'intel' }}
+    steps:
+    - uses: actions/checkout@v2
+
+    - name: Set up Python
+      uses: actions/setup-python@v2
+      with:
+        python-version: '3.7'
+
+    - name: Build Package
+      run: |
+        pip install build wheel
+        release_version=$(cat e2eAIOK/version | head -1)
+        cd RecDP
+        echo $release_version > version
+        python3 setup.py sdist --with_prefix
+
+    - name: Upload Package
+      uses: pypa/gh-action-pypi-publish@master
+      with:
+        password: ${{ secrets.PYPI_API_TOKEN_E2EAIOKRECDP }}
+        packages_dir: RecDP/dist
diff --git a/README.md b/README.md
@@ -28,7 +28,9 @@ Intel® End-to-End AI Optimization Kit is a composable toolkits for E2E AI optim
 
 ## The key components are
 
-* [RecDP](RecDP/README.md):  A parallel data processing and feature engineering lib on top of Spark, and extensible to other data processing tools. It provides abstraction API to hide Spark programming complexity, delivers optimized performance through adaptive query plan & strategy, supports critical feature engineering functions on Tabular dataset, and can be easily integrated to third party solutions.  
+* [RecDP](RecDP/README.md):  An one stop toolkit for AI data process. This toolkit provides LLM data processing and Machine Learning Feature Engineering lib in scalable fashion on top of Ray and Spark. It provides simple to use API for data scientists, delivers optimized performance, and can be easily integrated to third party solutions.
+  * [Auto Feature Engineering](RecDP/pyrecdp/autofe/README.md): Provides an automatical way to generate new features for any tabular dataset which containing numericals, categoricals and text features. It only takes 3 lines of codes to automatically enrich features based on data analysis, statistics, clustering and multi-feature interacting.
+  * [LLM Data Preparation](RecDP/pyrecdp/LLM/README.md). Provides a parallelled easy-to-use data pipeline for LLM data processing. It supports multiple data source such as jsonlines, pdfs, images, audio/vides. Users will be able to perform data extraction, deduplication(near dedup, rouge, exact), splitting, special_character fixing, types of filtering(length, perplexity, profanity, etc), quality analysis(diversity, GPT3 quality, toxicity, perplexity, etc). This tool also support to save output as jsonlines, parquets, or insertion into VectorStores(FaissStore, ChromaStore, ElasticSearchStore).
 
 * [Smart Democratization Advisor (SDA)](e2eAIOK/SDA/README.md): A user-guided tool to facilitate automation of built-in model democratization via parameterized models, it generates yaml files based on user choice, provided build-in intelligence through parameterized models and leverage SigOpt for HPO. SDA converts the manual model tuning and optimization to assisted autoML and autoHPO. SDA provides a list of build-in optimized models ranging from RecSys, CV, NLP, ASR and RL.
 

diff --git a/RecDP/Dockerfile/DockerfileUbuntu b/RecDP/Dockerfile/DockerfileUbuntu
@@ -5,4 +5,5 @@ RUN DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre
 RUN pip install --upgrade pip
 RUN pip install pyspark
 RUN pip install graphviz jupyterlab
-RUN apt-get install -y tesseract-ocr
+RUN apt-get install -y tesseract-ocr
+RUN apt-get update && apt-get install -y ffmpeg
diff --git a/RecDP/README.md b/RecDP/README.md
@@ -2,14 +2,16 @@
 
 We provide intel optimized solution for
 
-* [**Tabular**](pyrecdp/autofe/README.md) - Auto Feature Engineering Pipeline, 50+ essential primitives for feature engineering.
-* [**LLM Text**](pyrecdp/LLM/README.md) - 10+ essential primitives for text clean, fixing, deduplication, 4 quality control module, 2 built-in high quality data pipelines.
+* [**Auto Feature Engineering**](pyrecdp/autofe/README.md) -  Provides an automatical way to generate new features for any tabular dataset which containing numericals, categoricals and text features. It only takes 3 lines of codes to automatically enrich features based on data analysis, statistics, clustering and multi-feature interacting.
+* [**LLM Data Preparation**](pyrecdp/LLM/README.md) - Provides a parallelled easy-to-use data pipeline for LLM data processing. It supports multiple data source such as jsonlines, pdfs, images, audio/vides. Users will be able to perform data extraction, deduplication(near dedup, rouge, exact), splitting, special_character fixing, types of filtering(length, perplexity, profanity, etc), quality analysis(diversity, GPT3 quality, toxicity, perplexity, etc). This tool also support to save output as jsonlines, parquets, or insertion into VectorStores(FaissStore, ChromaStore, ElasticSearchStore).
 
-## Getting Started
+## How it works
+
+Install this tool through pip. 
 
 ```
 DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre graphviz
-pip install pyrecdp --pre
+pip install pyrecdp[all] --pre
 ```
 
 ## RecDP - Tabular
@@ -35,26 +37,17 @@ transformed_train_df = pipeline.fit_transform()
 * Low-code Fault-tolerant Auto-scaling Parallel Pipeline
 ![LLM Pipeline](resources/llm_pipeline.jpg)
 
-Low Code to build your own pipeline
-```
-from pyrecdp.LLM import ResumableTextPipeline
-pipeline = ResumableTextPipeline("usecase/finetune_pipeline.yaml")
-ret = pipeline.execute()
-```
-or
 ```
 from pyrecdp.primitives.operations import *
 from pyrecdp.LLM import ResumableTextPipeline
 
 pipeline = ResumableTextPipeline()
 ops = [
-    JsonlReader("data/"),
-    URLFilter(),
-    LengthFilter(),
+    UrlLoader(urls, max_depth=2),
+    DocumentSplit(),
     ProfanityFilter(),
-    TextFix(),
-    LanguageIdentify(),
     PIIRemoval(),
+    ...
     PerfileParquetWriter("ResumableTextPipeline_output")
 ]
 pipeline.add_operations(ops)
@@ -67,4 +60,4 @@ pipeline.execute()
 ## Dependency
 * Spark 3.4.*
 * python 3.*
-* Ray 2.7.*
+* Ray 2.7.*
diff --git a/RecDP/examples/notebooks/llmutils/contraction_remove.ipynb b/RecDP/examples/notebooks/llmutils/contraction_remove.ipynb
@@ -0,0 +1,173 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "id": "z_gVsK2fahsZ",
+      "metadata": {
+        "id": "z_gVsK2fahsZ"
+      },
+      "source": [
+        "# RecDP LLM - Expanding Contractions\n",
+        "\n",
+        "\n",
+        "Contractions are shortened versions of words or syllables. They are created by removing, one or more letters from words. Sometimes, multiple words are combined to create a contraction. For example, \"I will is contracted into I’ll, do not into don’t.\" Considering I will and I’ll differently might result in poor performance of the model. Hence, it’s a good practice to convert each contraction into its expanded form. Recdp use the contractions library to convert contractions into their expanded form.\n",
+        "\n",
+        "\n",
+        "\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "lFH8BqLubYLI",
+      "metadata": {
+        "id": "lFH8BqLubYLI"
+      },
+      "source": [
+        "# Get started"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "n35FAQmcbdY_",
+      "metadata": {
+        "id": "n35FAQmcbdY_"
+      },
+      "source": [
+        "## 1. Install pyrecdp and dependencies"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "wzlH_Ms3bnGM",
+      "metadata": {
+        "id": "wzlH_Ms3bnGM"
+      },
+      "outputs": [],
+      "source": [
+        "! DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre\n",
+        "! pip install pyrecdp --pre\n",
+        "# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "LHPfbKs7be8l",
+      "metadata": {
+        "id": "LHPfbKs7be8l"
+      },
+      "source": [
+        "## 2. Prepare your own data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "ED6Z8QPdbwoF",
+      "metadata": {
+        "id": "ED6Z8QPdbwoF"
+      },
+      "outputs": [],
+      "source": [
+        "%mkdir -p /content/test_data\n",
+        "%cd /content/test_data\n",
+        "!wget https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/github_sample_50.jsonl"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "iIZVijQ7cG1N",
+      "metadata": {
+        "id": "iIZVijQ7cG1N"
+      },
+      "source": [
+        "## 3. Expanding Contractions"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### 3.1 Process with Expanding Contractions\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "5EaDIIzQ0YiG"
+      },
+      "id": "5EaDIIzQ0YiG"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "736fb211-dbe6-4ca9-a1b1-db2cff2d287a",
+      "metadata": {
+        "id": "736fb211-dbe6-4ca9-a1b1-db2cff2d287a"
+      },
+      "outputs": [],
+      "source": [
+        "from pyrecdp.LLM import TextPipeline\n",
+        "from pyrecdp.primitives.operations import *\n",
+        "\n",
+        "input_path = \"/content/test_data/\"\n",
+        "output_path = \"TextPipeline_output\"\n",
+        "pipeline = TextPipeline()\n",
+        "ops = [\n",
+        "    JsonlReader(input_path),\n",
+        "    TextContractionRemove(),\n",
+        "    ParquetWriter(output_path)\n",
+        "]\n",
+        "pipeline.add_operations(ops)\n",
+        "ret = pipeline.execute()\n",
+        "del pipeline\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### 3.2 View processed data"
+      ],
+      "metadata": {
+        "id": "J5Lv3IZw0TNG"
+      },
+      "id": "J5Lv3IZw0TNG"
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import pandas as pd\n",
+        "result_pd = pd.read_parquet(output_path)\n",
+        "result_pd.head()"
+      ],
+      "metadata": {
+        "id": "5wlprXy00gBf"
+      },
+      "id": "5wlprXy00gBf",
+      "execution_count": null,
+      "outputs": []
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3 (ipykernel)",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.12"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}