Skip to content

Commit

Permalink
feat(clean): add updated version of rapidfuzz and python-crfsuite
Browse files Browse the repository at this point in the history
add submodule of Levenshtein and python-crfsuite

try to compile Levenshtein with setup.py

failed poetry build

successfully connect poetry and cmake build

delete the python-Levenshtein package

add cython compiler

fix(clean) : add precompiling code for python-crfsuite

Polish the build.py

Connect the clean util function with the compiled file

Add clean_build pipeline file

Fix pip install whl

Update clean_build.yml

Update clean_build.yml

Update clean_build

Update clean_build.yml

feat(clean) : update build.py

feat(clean) : update build.py

feat(clean) : update build.py

feat(clean) : update build.py

feat(clean) : update build.py

feat(clean) : update build.py

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml

feat(clean) : update clean_build.yml to release.yml

add all workflows
  • Loading branch information
qidanrui authored and dovahcrow committed Jul 7, 2022
1 parent 7b2ce3f commit 59f3506
Show file tree
Hide file tree
Showing 10 changed files with 1,488 additions and 882 deletions.
7 changes: 6 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ on:
- release
pull_request:
branches:
- develop
- develop

jobs:
build:
Expand All @@ -20,6 +20,8 @@ jobs:
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v2
with:
submodules: recursive

- uses: actions/setup-python@v2
with:
Expand All @@ -38,6 +40,9 @@ jobs:
poetry install
poetry config --list
- name: Build binary dependencies
run: poetry build

- name: Print tool versions
run: |
poetry run pylint --version
Expand Down
259 changes: 197 additions & 62 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,43 +6,45 @@ on:
- release

jobs:
# docs-build:
# runs-on: ubuntu-latest
# steps:
# - uses: actions/checkout@v2

# - name: Install dependencies
# run: |
# pip install poetry
# curl -L https://github.com/jgm/pandoc/releases/download/2.11.2/pandoc-2.11.2-1-amd64.deb -o /tmp/pandoc.deb && sudo dpkg -i /tmp/pandoc.deb

# - name: Cache venv
# uses: actions/cache@v2
# with:
# path: ~/.cache/pypoetry/virtualenvs
# key: ${{ runner.os }}-build-${{ matrix.python }}-${{ secrets.CACHE_VERSION }}-${{ hashFiles('poetry.lock') }}

# - name: Install dependencies
# run: |
# pip install poetry
# poetry install

# - name: Build docs
# run: poetry run sphinx-build -M html docs/source docs/build

# - name: Archive docs
# uses: actions/upload-artifact@v2
# with:
# name: docs
# path: docs/build/html

build:
runs-on: ubuntu-latest
# needs: docs-build
# docs-build:
# runs-on: ubuntu-latest
# steps:
# - uses: actions/checkout@v2
# - name: Install dependencies
# run: |
# pip install poetry
# curl -L https://github.com/jgm/pandoc/releases/download/2.11.2/pandoc-2.11.2-1-amd64.deb -o /tmp/pandoc.deb && sudo dpkg -i /tmp/pandoc.deb

# - name: Cache venv
# uses: actions/cache@v2
# with:
# path: ~/.cache/pypoetry/virtualenvs
# key: ${{ runner.os }}-build-${{ matrix.python }}-${{ secrets.CACHE_VERSION }}-${{ hashFiles('poetry.lock') }}

# - name: Install dependencies
# run: |
# pip install poetry
# poetry install

# - name: Build docs
# run: poetry run sphinx-build -M html docs/source docs/build

# - name: Archive docs
# uses: actions/upload-artifact@v2
# with:
# name: docs
# path: docs/build/html
build-wheels-linux-macos:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest]
python: ["3.8", "3.9"]
steps:
- uses: actions/checkout@v2
with:
fetch-depth: "0"
submodules: recursive

- uses: actions/setup-python@v1
with:
Expand All @@ -61,15 +63,148 @@ jobs:
poetry install
poetry config --list
- name: Print tool versions
- name: Build wheels
run: poetry build

- name: Upload wheels
uses: actions/upload-artifact@v2
with:
name: "${{ matrix.os }}-${{ matrix.python}}"
path: dist/*.whl

build-wheels-windows:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [windows-latest]
python: ["3.8", "3.9"]
steps:
- uses: actions/checkout@v2
with:
submodules: recursive

- uses: actions/setup-python@v1
with:
python-version: ${{ matrix.python }}

- name: Cache venv
uses: actions/cache@v2
with:
path: ~/.cache/pypoetry/virtualenvs
key: ${{ runner.os }}-build-${{ matrix.python }}-${{ secrets.CACHE_VERSION }}-${{ hashFiles('poetry.lock') }}

- name: Install dependencies
run: |
poetry run pylint --version
poetry run pytest --version
poetry run black --version
poetry run pyright --version
echo "Cache Version ${{ secrets.CACHE_VERSION }}"
pip install poetry toml-cli
poetry install
poetry config --list
- name: Build wheels
run: poetry build

- name: Upload wheels
uses: actions/upload-artifact@v2
with:
name: "${{ matrix.os }}-${{ matrix.python}}"
path: dist/*.whl

build-sdist:
name: Build source distribution
runs-on: ${{ matrix.os }}
strategy:
matrix:
python: ["3.9"]
os: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
submodules: recursive

- uses: actions/setup-python@v1
with:
python-version: '3.9'

- name: Cache venv
uses: actions/cache@v2
with:
path: ~/.cache/pypoetry/virtualenvs
key: ${{ runner.os }}-build-${{ matrix.python }}-${{ secrets.CACHE_VERSION }}-${{ hashFiles('poetry.lock') }}

- name: Install dependencies
run: |
echo "Cache Version ${{ secrets.CACHE_VERSION }}"
pip install poetry toml-cli
poetry install
poetry config --list
- name: Build wheels
run: poetry build

- name: Upload sdist
uses: actions/upload-artifact@v2
with:
path: dist/*.tar.gz

verify:
runs-on: ${{ matrix.os }}
needs: [build-wheels-linux-macos, build-wheels-windows, build-sdist]
strategy:
matrix:
python: ["3.8", "3.9"]
os: [macos-latest, ubuntu-latest, windows-latest]
steps:
- uses: actions/checkout@v2
with:
submodules: recursive

- uses: actions/setup-python@v1
with:
python-version: ${{ matrix.python }}

- uses: actions/download-artifact@v3
with:
name: "${{ matrix.os }}-${{ matrix.python}}"

- run: |
pip install *.whl
python -c "import dataprep"
upload:
runs-on: ${{ matrix.os }}
# needs: docs-build
needs: [verify]
strategy:
matrix:
python: ["3.9"]
os: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
submodules: recursive

- uses: actions/setup-python@v1
with:
python-version: ${{ matrix.python }}

- name: Cache venv
uses: actions/cache@v2
with:
path: ~/.cache/pypoetry/virtualenvs
key: ${{ runner.os }}-build-${{ matrix.python }}-${{ secrets.CACHE_VERSION }}-${{ hashFiles('poetry.lock') }}

- name: Install dependencies
run: |
echo "Cache Version ${{ secrets.CACHE_VERSION }}"
pip install poetry toml-cli
poetry install
poetry config --list
- name: Download all artifacts
uses: actions/download-artifact@v3
with:
path: dist

- name: Parse version from pyproject.toml
run: echo "DATAPREP_VERSION=`toml get --toml-path pyproject.toml tool.poetry.version`" >> $GITHUB_ENV
Expand All @@ -79,7 +214,7 @@ jobs:

- uses: ncipollo/release-action@v1
with:
artifacts: "dist/*.whl,dist/*.tar.gz"
artifacts: "dist/**/*.whl,dist/**/*.tar.gz"
bodyFile: "RELEASE.md"
token: ${{ secrets.GITHUB_TOKEN }}
draft: true
Expand All @@ -88,25 +223,25 @@ jobs:

- name: Upload wheels
run: poetry publish --username __token__ --password ${{ secrets.PYPI_TOKEN }}

# docs-deploy:
# runs-on: ubuntu-latest
# needs: build
# if: ${{ github.event.ref == 'refs/heads/release' }}
# steps:
# - uses: actions/checkout@v2

# - name: Download docs
# uses: actions/download-artifact@v2
# with:
# name: docs
# path: docs/build/html

# - run: echo 'docs.dataprep.ai' > docs/build/html/CNAME

# - name: Deploy 🚀
# uses: JamesIves/[email protected]
# with:
# branch: gh-pages # The branch the action should deploy to.
# folder: docs/build/html # The folder the action should deploy.
# clean-exclude: dev
# docs-deploy:
# runs-on: ubuntu-latest
# needs: build
# if: ${{ github.event.ref == 'refs/heads/release' }}
# steps:
# - uses: actions/checkout@v2

# - name: Download docs
# uses: actions/download-artifact@v2
# with:
# name: docs
# path: docs/build/html

# - run: echo 'docs.dataprep.ai' > docs/build/html/CNAME

# - name: Deploy 🚀
# uses: JamesIves/[email protected]
# with:
# branch: gh-pages # The branch the action should deploy to.
# folder: docs/build/html # The folder the action should deploy.
# clean-exclude: dev
1 change: 1 addition & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

1 change: 1 addition & 0 deletions dataprep/clean/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
dataprep.clean
==============
"""

from .clean_lat_long import clean_lat_long, validate_lat_long

from .clean_email import clean_email, validate_email
Expand Down
4 changes: 1 addition & 3 deletions dataprep/clean/address_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,12 @@
from builtins import str
from typing import Any, Dict, List, Optional, Tuple
from collections import OrderedDict

import os
import string
import re
import os
import warnings
import pycrfsuite


TAG_MAPPING = {
"OccupancyType": "apartment",
"OccupancyIdentifier": "apartment",
Expand Down
6 changes: 3 additions & 3 deletions dataprep/clean/clean_duplication_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,16 @@
from itertools import permutations
from os import path
from tempfile import mkdtemp

import pandas as pd
import dask.dataframe as dd
import dask
from IPython.display import Javascript, display
from Levenshtein import distance
from metaphone import doublemetaphone
from rapidfuzz.distance.Levenshtein import distance as LevenshteinDistance

from .utils import to_dask


DECODE_FUNC = """
function b64DecodeUnicode(str) {
// Going backwards: from bytestream, to percent-encoding, to original string.
Expand Down Expand Up @@ -141,7 +141,7 @@ def _get_nearest_neighbour_clusters(
continue

cluster_map[center].add(center)
dist = distance(center, val)
dist = LevenshteinDistance(center, val)
if dist <= radius or radius < 0:
cluster_map[center].add(val)

Expand Down
2 changes: 1 addition & 1 deletion dataprep/clean/gui/clean_gui.py
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,7 @@
"clean_json",
"clean_address",
"clean_df",
# "clean_duplication",
"clean_duplication",
"clean_currency",
"clean_au_abn",
"clean_au_acn",
Expand Down
5 changes: 2 additions & 3 deletions dataprep/clean/utils.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
"""Common functions"""
from typing import Dict, Union, Any
import http.client
import json
from math import ceil
from typing import Any, Dict, Union

import dask.dataframe as dd
import numpy as np
import pandas as pd
from math import ceil


NULL_VALUES = {
np.nan,
Expand Down
Loading

0 comments on commit 59f3506

Please sign in to comment.