Skip to content

Commit

Permalink
Merge branch 'intel:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
XinyaoWa authored Jan 31, 2024
2 parents 8e670e0 + bfc914a commit 62b6d8d
Show file tree
Hide file tree
Showing 208 changed files with 11,269 additions and 2,161 deletions.
37 changes: 37 additions & 0 deletions .github/workflows/e2eaiok_recdp_release_pypi.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: Publish RecDP Stable Release to PyPI

on:
workflow_dispatch:
push:
branches:
- main

permissions:
contents: read
packages: write

jobs:
e2eaiok-release-python-pypi:
runs-on: self-hosted
if: ${{ github.repository_owner == 'intel' }}
steps:
- uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.7'

- name: Build Package
run: |
pip install build wheel
release_version=$(cat e2eAIOK/version | head -1)
cd RecDP
echo $release_version > version
python3 setup.py sdist --with_prefix
- name: Upload Package
uses: pypa/gh-action-pypi-publish@master
with:
password: ${{ secrets.PYPI_API_TOKEN_E2EAIOKRECDP }}
packages_dir: RecDP/dist
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,9 @@ Intel® End-to-End AI Optimization Kit is a composable toolkits for E2E AI optim

## The key components are

* [RecDP](RecDP/README.md): A parallel data processing and feature engineering lib on top of Spark, and extensible to other data processing tools. It provides abstraction API to hide Spark programming complexity, delivers optimized performance through adaptive query plan & strategy, supports critical feature engineering functions on Tabular dataset, and can be easily integrated to third party solutions.
* [RecDP](RecDP/README.md): An one stop toolkit for AI data process. This toolkit provides LLM data processing and Machine Learning Feature Engineering lib in scalable fashion on top of Ray and Spark. It provides simple to use API for data scientists, delivers optimized performance, and can be easily integrated to third party solutions.
* [Auto Feature Engineering](RecDP/pyrecdp/autofe/README.md): Provides an automatical way to generate new features for any tabular dataset which containing numericals, categoricals and text features. It only takes 3 lines of codes to automatically enrich features based on data analysis, statistics, clustering and multi-feature interacting.
* [LLM Data Preparation](RecDP/pyrecdp/LLM/README.md). Provides a parallelled easy-to-use data pipeline for LLM data processing. It supports multiple data source such as jsonlines, pdfs, images, audio/vides. Users will be able to perform data extraction, deduplication(near dedup, rouge, exact), splitting, special_character fixing, types of filtering(length, perplexity, profanity, etc), quality analysis(diversity, GPT3 quality, toxicity, perplexity, etc). This tool also support to save output as jsonlines, parquets, or insertion into VectorStores(FaissStore, ChromaStore, ElasticSearchStore).

* [Smart Democratization Advisor (SDA)](e2eAIOK/SDA/README.md): A user-guided tool to facilitate automation of built-in model democratization via parameterized models, it generates yaml files based on user choice, provided build-in intelligence through parameterized models and leverage SigOpt for HPO. SDA converts the manual model tuning and optimization to assisted autoML and autoHPO. SDA provides a list of build-in optimized models ranging from RecSys, CV, NLP, ASR and RL.

Expand Down
3 changes: 2 additions & 1 deletion RecDP/Dockerfile/DockerfileUbuntu
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ RUN DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre
RUN pip install --upgrade pip
RUN pip install pyspark
RUN pip install graphviz jupyterlab
RUN apt-get install -y tesseract-ocr
RUN apt-get install -y tesseract-ocr
RUN apt-get update && apt-get install -y ffmpeg
27 changes: 10 additions & 17 deletions RecDP/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,16 @@

We provide intel optimized solution for

* [**Tabular**](pyrecdp/autofe/README.md) - Auto Feature Engineering Pipeline, 50+ essential primitives for feature engineering.
* [**LLM Text**](pyrecdp/LLM/README.md) - 10+ essential primitives for text clean, fixing, deduplication, 4 quality control module, 2 built-in high quality data pipelines.
* [**Auto Feature Engineering**](pyrecdp/autofe/README.md) - Provides an automatical way to generate new features for any tabular dataset which containing numericals, categoricals and text features. It only takes 3 lines of codes to automatically enrich features based on data analysis, statistics, clustering and multi-feature interacting.
* [**LLM Data Preparation**](pyrecdp/LLM/README.md) - Provides a parallelled easy-to-use data pipeline for LLM data processing. It supports multiple data source such as jsonlines, pdfs, images, audio/vides. Users will be able to perform data extraction, deduplication(near dedup, rouge, exact), splitting, special_character fixing, types of filtering(length, perplexity, profanity, etc), quality analysis(diversity, GPT3 quality, toxicity, perplexity, etc). This tool also support to save output as jsonlines, parquets, or insertion into VectorStores(FaissStore, ChromaStore, ElasticSearchStore).

## Getting Started
## How it works

Install this tool through pip.

```
DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre graphviz
pip install pyrecdp --pre
pip install pyrecdp[all] --pre
```

## RecDP - Tabular
Expand All @@ -35,26 +37,17 @@ transformed_train_df = pipeline.fit_transform()
* Low-code Fault-tolerant Auto-scaling Parallel Pipeline
![LLM Pipeline](resources/llm_pipeline.jpg)

Low Code to build your own pipeline
```
from pyrecdp.LLM import ResumableTextPipeline
pipeline = ResumableTextPipeline("usecase/finetune_pipeline.yaml")
ret = pipeline.execute()
```
or
```
from pyrecdp.primitives.operations import *
from pyrecdp.LLM import ResumableTextPipeline
pipeline = ResumableTextPipeline()
ops = [
JsonlReader("data/"),
URLFilter(),
LengthFilter(),
UrlLoader(urls, max_depth=2),
DocumentSplit(),
ProfanityFilter(),
TextFix(),
LanguageIdentify(),
PIIRemoval(),
...
PerfileParquetWriter("ResumableTextPipeline_output")
]
pipeline.add_operations(ops)
Expand All @@ -67,4 +60,4 @@ pipeline.execute()
## Dependency
* Spark 3.4.*
* python 3.*
* Ray 2.7.*
* Ray 2.7.*
173 changes: 173 additions & 0 deletions RecDP/examples/notebooks/llmutils/contraction_remove.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "z_gVsK2fahsZ",
"metadata": {
"id": "z_gVsK2fahsZ"
},
"source": [
"# RecDP LLM - Expanding Contractions\n",
"\n",
"\n",
"Contractions are shortened versions of words or syllables. They are created by removing, one or more letters from words. Sometimes, multiple words are combined to create a contraction. For example, \"I will is contracted into I’ll, do not into don’t.\" Considering I will and I’ll differently might result in poor performance of the model. Hence, it’s a good practice to convert each contraction into its expanded form. Recdp use the contractions library to convert contractions into their expanded form.\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "lFH8BqLubYLI",
"metadata": {
"id": "lFH8BqLubYLI"
},
"source": [
"# Get started"
]
},
{
"cell_type": "markdown",
"id": "n35FAQmcbdY_",
"metadata": {
"id": "n35FAQmcbdY_"
},
"source": [
"## 1. Install pyrecdp and dependencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "wzlH_Ms3bnGM",
"metadata": {
"id": "wzlH_Ms3bnGM"
},
"outputs": [],
"source": [
"! DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre\n",
"! pip install pyrecdp --pre\n",
"# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'"
]
},
{
"cell_type": "markdown",
"id": "LHPfbKs7be8l",
"metadata": {
"id": "LHPfbKs7be8l"
},
"source": [
"## 2. Prepare your own data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ED6Z8QPdbwoF",
"metadata": {
"id": "ED6Z8QPdbwoF"
},
"outputs": [],
"source": [
"%mkdir -p /content/test_data\n",
"%cd /content/test_data\n",
"!wget https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/github_sample_50.jsonl"
]
},
{
"cell_type": "markdown",
"id": "iIZVijQ7cG1N",
"metadata": {
"id": "iIZVijQ7cG1N"
},
"source": [
"## 3. Expanding Contractions"
]
},
{
"cell_type": "markdown",
"source": [
"### 3.1 Process with Expanding Contractions\n",
"\n"
],
"metadata": {
"id": "5EaDIIzQ0YiG"
},
"id": "5EaDIIzQ0YiG"
},
{
"cell_type": "code",
"execution_count": null,
"id": "736fb211-dbe6-4ca9-a1b1-db2cff2d287a",
"metadata": {
"id": "736fb211-dbe6-4ca9-a1b1-db2cff2d287a"
},
"outputs": [],
"source": [
"from pyrecdp.LLM import TextPipeline\n",
"from pyrecdp.primitives.operations import *\n",
"\n",
"input_path = \"/content/test_data/\"\n",
"output_path = \"TextPipeline_output\"\n",
"pipeline = TextPipeline()\n",
"ops = [\n",
" JsonlReader(input_path),\n",
" TextContractionRemove(),\n",
" ParquetWriter(output_path)\n",
"]\n",
"pipeline.add_operations(ops)\n",
"ret = pipeline.execute()\n",
"del pipeline\n"
]
},
{
"cell_type": "markdown",
"source": [
"### 3.2 View processed data"
],
"metadata": {
"id": "J5Lv3IZw0TNG"
},
"id": "J5Lv3IZw0TNG"
},
{
"cell_type": "code",
"source": [
"import pandas as pd\n",
"result_pd = pd.read_parquet(output_path)\n",
"result_pd.head()"
],
"metadata": {
"id": "5wlprXy00gBf"
},
"id": "5wlprXy00gBf",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading

0 comments on commit 62b6d8d

Please sign in to comment.