forked from intel/e2eAIOK
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
208 changed files
with
11,269 additions
and
2,161 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
name: Publish RecDP Stable Release to PyPI | ||
|
||
on: | ||
workflow_dispatch: | ||
push: | ||
branches: | ||
- main | ||
|
||
permissions: | ||
contents: read | ||
packages: write | ||
|
||
jobs: | ||
e2eaiok-release-python-pypi: | ||
runs-on: self-hosted | ||
if: ${{ github.repository_owner == 'intel' }} | ||
steps: | ||
- uses: actions/checkout@v2 | ||
|
||
- name: Set up Python | ||
uses: actions/setup-python@v2 | ||
with: | ||
python-version: '3.7' | ||
|
||
- name: Build Package | ||
run: | | ||
pip install build wheel | ||
release_version=$(cat e2eAIOK/version | head -1) | ||
cd RecDP | ||
echo $release_version > version | ||
python3 setup.py sdist --with_prefix | ||
- name: Upload Package | ||
uses: pypa/gh-action-pypi-publish@master | ||
with: | ||
password: ${{ secrets.PYPI_API_TOKEN_E2EAIOKRECDP }} | ||
packages_dir: RecDP/dist |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
173 changes: 173 additions & 0 deletions
173
RecDP/examples/notebooks/llmutils/contraction_remove.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,173 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "z_gVsK2fahsZ", | ||
"metadata": { | ||
"id": "z_gVsK2fahsZ" | ||
}, | ||
"source": [ | ||
"# RecDP LLM - Expanding Contractions\n", | ||
"\n", | ||
"\n", | ||
"Contractions are shortened versions of words or syllables. They are created by removing, one or more letters from words. Sometimes, multiple words are combined to create a contraction. For example, \"I will is contracted into I’ll, do not into don’t.\" Considering I will and I’ll differently might result in poor performance of the model. Hence, it’s a good practice to convert each contraction into its expanded form. Recdp use the contractions library to convert contractions into their expanded form.\n", | ||
"\n", | ||
"\n", | ||
"\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "lFH8BqLubYLI", | ||
"metadata": { | ||
"id": "lFH8BqLubYLI" | ||
}, | ||
"source": [ | ||
"# Get started" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "n35FAQmcbdY_", | ||
"metadata": { | ||
"id": "n35FAQmcbdY_" | ||
}, | ||
"source": [ | ||
"## 1. Install pyrecdp and dependencies" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "wzlH_Ms3bnGM", | ||
"metadata": { | ||
"id": "wzlH_Ms3bnGM" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"! DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre\n", | ||
"! pip install pyrecdp --pre\n", | ||
"# ! pip install 'git+https://github.com/intel/e2eAIOK.git#egg=pyrecdp&subdirectory=RecDP'" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "LHPfbKs7be8l", | ||
"metadata": { | ||
"id": "LHPfbKs7be8l" | ||
}, | ||
"source": [ | ||
"## 2. Prepare your own data" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "ED6Z8QPdbwoF", | ||
"metadata": { | ||
"id": "ED6Z8QPdbwoF" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"%mkdir -p /content/test_data\n", | ||
"%cd /content/test_data\n", | ||
"!wget https://raw.githubusercontent.com/intel/e2eAIOK/main/RecDP/tests/data/llm_data/github_sample_50.jsonl" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "iIZVijQ7cG1N", | ||
"metadata": { | ||
"id": "iIZVijQ7cG1N" | ||
}, | ||
"source": [ | ||
"## 3. Expanding Contractions" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"### 3.1 Process with Expanding Contractions\n", | ||
"\n" | ||
], | ||
"metadata": { | ||
"id": "5EaDIIzQ0YiG" | ||
}, | ||
"id": "5EaDIIzQ0YiG" | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "736fb211-dbe6-4ca9-a1b1-db2cff2d287a", | ||
"metadata": { | ||
"id": "736fb211-dbe6-4ca9-a1b1-db2cff2d287a" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from pyrecdp.LLM import TextPipeline\n", | ||
"from pyrecdp.primitives.operations import *\n", | ||
"\n", | ||
"input_path = \"/content/test_data/\"\n", | ||
"output_path = \"TextPipeline_output\"\n", | ||
"pipeline = TextPipeline()\n", | ||
"ops = [\n", | ||
" JsonlReader(input_path),\n", | ||
" TextContractionRemove(),\n", | ||
" ParquetWriter(output_path)\n", | ||
"]\n", | ||
"pipeline.add_operations(ops)\n", | ||
"ret = pipeline.execute()\n", | ||
"del pipeline\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"### 3.2 View processed data" | ||
], | ||
"metadata": { | ||
"id": "J5Lv3IZw0TNG" | ||
}, | ||
"id": "J5Lv3IZw0TNG" | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"import pandas as pd\n", | ||
"result_pd = pd.read_parquet(output_path)\n", | ||
"result_pd.head()" | ||
], | ||
"metadata": { | ||
"id": "5wlprXy00gBf" | ||
}, | ||
"id": "5wlprXy00gBf", | ||
"execution_count": null, | ||
"outputs": [] | ||
} | ||
], | ||
"metadata": { | ||
"colab": { | ||
"provenance": [] | ||
}, | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.12" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Oops, something went wrong.