Skip to content

Commit

Permalink
發佈到 PyPI (#5)
Browse files Browse the repository at this point in the history
* Prepare to publish to pypi

* Resolve comments

* More updates to rules

* Refactor for the cli tool

* Restructure for package release

* 加入 㩒

* Add 'label' output type

* Update README

* Add tests

* Add github action for publishing

* Publish to pypi

* v1.0.1

* Bump version
  • Loading branch information
laubonghaudoi authored Dec 9, 2023
1 parent cd9d8b7 commit 2c56043
Show file tree
Hide file tree
Showing 9 changed files with 282 additions and 45 deletions.
29 changes: 29 additions & 0 deletions .github/workflows/publish-to-pypi.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: Publish Python Package to PyPI

on:
release:
types: [published]

jobs:
build-and-publish:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: "3.12"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
- name: Run tests
run: |
python -m unittest tests/test_judge.py
- name: Build and publish
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
run: |
python setup.py sdist bdist_wheel
twine upload dist/*
130 changes: 118 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# 粵文分類器
# 粵文分類篩選器

[![license](https://img.shields.io/github/license/DAVFoundation/captain-n3m0.svg?style=flat-square)](https://github.com/DAVFoundation/captain-n3m0/blob/master/LICENSE)

[English](https://github.com/CanCLID/cantonese-classifier#cantonese-text-classifier)

呢個係個粵文分類器,用嚟區分粵語同官話文本,對於篩選粵語語料好有用。個分類器會將輸入文本分成四類:
## 簡介

呢個係個粵文篩選器,用嚟區分粵語同官話文本,對於篩選粵語語料好有用。個分類器會將輸入文本分成四類:

1. `cantonese`: 純粵文,僅含有粵語特徵字詞,例如“你喺邊度”
1. `mandarin`: 純官話文,僅含有官話特徵字詞,例如“你在哪裏”
Expand All @@ -13,37 +15,141 @@

分類方法係官話同粵語嘅特徵字詞識別。如果同時含有官話同粵語特徵詞彙就算官粵混雜,如果唔含有任何特徵,就算冇特徵中性文本。

本篩選器嘅主要設計目標係「篩選出可以用作訓練數據嘅優質粵文」,而非「準確分類輸入文本」。所以喺判斷粵語/官話嗰陣會用偏嚴格嘅判別標準,即係會犧牲 recall 嚟換取高 precision (寧願篩漏粵文句子都唔好將官話文誤判成粵文)。

注意:呢隻分類器**默認所有輸入文本都係傳統漢字**。如果要分類簡化字文本,要將佢哋轉化成傳統漢字先。推薦使用 [OpenCC](https://github.com/BYVoid/OpenCC)嚟轉換。

## 用法

首先要有一個輸入文檔,例如`input.txt`,入面每一行係一個句子,然後運行下面命令
首先用 pip 安裝

```bash
pip install canto-filter
```

你可以喺 Python 代碼入面用,亦都可以直接喺命令行入面用。

### Python 函數用法

本篩選器剩得一個函數 `judge()`,輸入一句話輸出佢嘅語言分類:

```python
from cantofilter import judge

print(judge('你喺邊度')) # cantonese
print(judge('你在哪裏')) # mandarin
print(judge('是咁的')) # mixed
print(judge('去學校讀書')) # neutral
```

### 命令行用法

首先要有一個輸入文檔,例如`input.txt`,入面每行一個句子.

#### 輸出標籤同原文

然後運行下面命令

```bash
cantofilter --input input.txt > output.txt
```

噉樣會得到一個 `output.txt`,入面有由 \t 分成嘅兩列,第一列係判斷標籤,第二列係句子原文本。

#### 僅輸出一類

如果你想直接篩選出某一類嘅文本,噉可以加一個 `--type <LABEL>` 參數喺後面,例如

```bash
python3 main.py --input <INPUT.TXT>
cantofilter main.py --input input.txt --type cantonese > output.txt
```

輸出係一個 `output.tsv`,入面有分成兩列,第一列係判斷標籤,第二列係句子原文本。
噉樣輸出嘅 `output.txt` 就會係純粵文句子。如果想剩係要官話、官粵混合或者中性文本,將個 `--type` 參數定成 `mandarin``mixed``neutral`就得。

#### 僅輸出標籤

你亦都可以剩係輸出啲句子嘅分類結果,用 `--type label` 就得:

```bash
cantofilter main.py --input input.txt --type label > output.txt
```

噉樣嘅 `output.txt` 剩得一列,全部都係分類標籤。

## 依賴

# Cantonese text classifier
Python >= 3.6

This is a text classifier for Cantonese, a very useful tool for filtering Cantonese text corpus. It classifies input sentences with four output labels:
# Cantonese text filter

This is a text filter for Cantonese, designed for filtering Cantonese text corpus. It classifies input sentences with four output labels:

1. `cantonese`: Pure Cantonese text, contains Cantonese-featured words. E.g. 你喺邊度
1. `mandarin`: Pure Mandarin text, contains Mandarin-feature words. E.g. 你在哪裏
1. `mixed`:Mixed Cantonese-Mandarin text, contains both Cantonese and Mandarin-featured words. E.g. 是咁的
1. `neutral`:No feature Chinese text, contains neither Cantonese nor Mandarin feature words. Such sentences can be used for both Cantonese and Mandarin text corpus. E.g. 去學校讀書

The classifier is rule-based, by detecting Mandarin and Cantonese feature characters and words. If a sentence contains both Cantonese and Mandarin feature words, then it is a mixed-Cantonese-Mandarin sentence. If it contains neither features, it is a no-feature, neutral Chinese text.
The filter is regex rule-based, by detecting Mandarin and Cantonese feature characters and words. If a sentence contains both Cantonese and Mandarin feature words, then it is a mixed-Cantonese-Mandarin sentence. If it contains neither features, it is a no-feature, neutral Chinese text.

Notice: This classifier **assumes all input text to be written in Traditional Chinese characters**. If you want to classified texts written in simplified characters, please convert them into Traditional characters first. We recommend using [OpenCC](https://github.com/BYVoid/OpenCC) to do the conversion.
Note: This filter **assumes all input text in Traditional Chinese characters**. If you want to filter texts written in simplified characters, please convert them into Traditional characters first. We recommend using [OpenCC](https://github.com/BYVoid/OpenCC) to do the conversion.

## How to use

Prepare an input text file, e.g. `input.txt` where each line is a sentence. Then run
Install the package with pip first

```bash
pip install canto-filter
```

This package can be used in python codes, or as a CLI tool.

### Python function usage

There is only one function in this package, `judge()`, which accepts a string input and outputs one of the labels:

```python
from cantofilter import judge

print(judge('你喺邊度')) # cantonese
print(judge('你在哪裏')) # mandarin
print(judge('是咁的')) # mixed
print(judge('去學校讀書')) # neutral
```

### CLI usage

Assume an input text file, e.g. `input.txt` where each line is a sentence.

#### Output both labels and original texts

Then run

```bash
python3 main.py --input <INPUT.TXT>
cantofilter --input input.txt > output.txt
```

There will be a `output.tsv` which has two columns. The first column is the classification label, and the second column is the original input text.
There will be a `output.txt` which has two columns. The first column is the language label, and the second column is the original input text.

#### Output only text of one class

If you want only one type of text, use the `--type <LABEL>` argument. Say if you want pure Cantonese text only:

```bash
cantofilter --input input.txt --type cantonese > output.txt
```

The `output.txt` will contain only Cantonese text.

#### Output label only

If you want the classification labels, use `--type label` like this:

```bash
cantofilter main.py --input input.txt --type label > output.txt
```

Then your `output.txt` will contain only classification results of the input sentences.

## Requirement

Python >= 3.6
2 changes: 2 additions & 0 deletions cantofilter/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from .judge import judge
from .version import __version__
30 changes: 30 additions & 0 deletions cantofilter/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import argparse
import sys
from .judge import judge

def main():
'''
When used as a command line tool, specify input text file with `--input <INPUT.txt>`, and output type with `--type <TYPE>`.
'''
argparser = argparse.ArgumentParser(
description='Specify input text file with `--input <INPUT.txt>`, where each line is a sentence. ')

argparser.add_argument('--input', type=str, default='input.txt', help='Specify input text file, where each line is a sentence. Default is `input.txt`.')
argparser.add_argument('--type', type=str, default='all', help='Specify the type of output. `all` for all sentences with a class label prepended, `cantonese` for Cantonese sentences, `mandarin` for Mandarin sentences, `mixed` for mixed Mandarin-Cantonese sentences, `neutral` for neutral sentences. Default is `all`.')

args = argparser.parse_args()

with open(args.input, encoding='utf-8') as f:
for line in f:
l = line.strip()
judgement: str = judge(l)
if args.type == 'all':
sys.stdout.write(f'{judgement}\t{l}\n')
elif args.type == 'label':
sys.stdout.write(f'{judgement}\n')
elif args.type == judgement:
sys.stdout.write(f'{l}\n')


if __name__ == '__main__':
main()
68 changes: 35 additions & 33 deletions main.py → cantofilter/judge.py
Original file line number Diff line number Diff line change
@@ -1,41 +1,51 @@
import argparse
import re
from typing import List, Tuple

canto_unique = re.compile(
r'[嘅嗰啲咗佢喺咁噉冇啩哋畀嚟諗惗乜嘢閪撚𨳍瞓睇㗎餸𨋢摷喎嚿噃嚡嘥嗮啱揾搵喐逳噏𢳂岋糴揈撳𥄫攰癐冚孻冧𡃁嚫跣𨃩瀡氹嬲]|' +
r'[嘅嗰啲咗佢喺咁噉冇啩哋畀嚟諗惗乜嘢閪撚𨳍𨳊瞓睇㗎餸𨋢摷喎嚿噃嚡嘥嗮啱揾搵喐逳噏𢳂岋糴揈捹撳㩒𥄫攰癐冚孻冧𡃁嚫跣𨃩瀡氹嬲掟孭]|' +
r'唔[係得會好識使洗駛通知到去走掂該]|點[樣會做得解]|[琴尋噚聽第]日|[而依]家|家[下陣]|[真就]係|邊[度個位科]|' +
r'[嚇凍冷攝整揩逢淥浸激][親嚫]|[橫搞打傾諗攞通得唔拆]掂|仲[有係話要得好衰唔]|' +
r'屋企|收皮')
r'[嚇凍攝整揩逢淥浸激][親嚫]|[橫搞傾諗得唔]掂|仲[有係話要得好衰唔]|返[學工去歸]|' +
r'屋企|收皮|傾[偈計]|幫襯|執[好生實返輸]|求其|是[但旦]|[濕溼]碎|零舍|肉[赤緊]')
mando_unique = re.compile(r'[這哪您們唄咱啥甭]|還[是好有]')
mando_feature = re.compile(r'[那是的他她吧沒不在麼么些了卻説說吃]|而已')
mando_feature = re.compile(r'[那是的他她吧沒在麼么些了卻説說吃弄]|而已')
mando_loan = re.compile(r'亞利桑那|剎那|巴塞羅那|薩那|沙那|哈瓦那|印第安那|那不勒斯|支那|' +
r'是日|是次|是非|利是|唯命是從|頭頭是道|似是而非|自以為是|俯拾皆是|撩是鬥非|莫衷一是|是但|是旦|大吉利是|' +
r'[目綠藍紅]的|的[士確]|波羅的海|眾矢之的|的而且確|' +
r'是[否日次非但旦]|利是|唯命是從|頭頭是道|似是而非|自以為是|俯拾皆是|撩是鬥非|莫衷一是|' +
r'[目綠藍紅]的|的[士確式]|波羅的海|眾矢之的|的而且確|' +
r'些[微少許小]|' +
r'[淹沉覆湮埋沒]沒|沒[落收]|神出鬼沒|' +
r'了[結無斷當然哥結得]|[未明]了|不了了之|不得了|大不了|' +
r'不[過滿如妨俗宜必死利當足絕一斷良同僅忠妙果]|迫不及待|意想不到|不外乎|風馬牛不相及|' +
r'[淹沉浸覆湮埋沒出]沒|沒[落收]|神出鬼沒|' +
r'了[結無斷當然哥結得解]|[未明]了|不了了之|不得了|大不了|' +
r'他[信人國日殺鄉]|[其利無排維]他|馬耳他|他加祿|他山之石|' +
r'在[場世讀於位編此]|[實存旨志好所自潛]在|無處不在|大有人在|' +
r'[酒網水貼]吧|吧台|' +
r'[退忘阻]卻|卻步|' +
r'[遊游小傳解學假淺眾衆][説說]|[說說][話服明]|自圓其[説說]|長話短[說説]|不由分[說説]' +
r'吃虧')
r'[遊游小傳解學假淺眾衆][説說]|[說說][話服明]|自圓其[説說]|長話短[說説]|不由分[說説]|' +
r'吃[虧苦]|' +
r'弄[堂]')


def is_within_loan_span(feature_span: Tuple[int, int], loan_spans: List[Tuple[int, int]]) -> bool:
# 判斷一個官話特徵係唔係借詞。如果佢嘅位置喺某個借詞區間,就係借詞
# Judge whether a Mandarin feature is a loan word. If its position is within a loan span, it is a loan.
'''
判斷一個官話特徵係唔係借詞。如果佢嘅位置喺某個借詞區間,就係借詞
Judge whether a Mandarin feature is a loan word. If its position is within a loan span, it is a loan.
Args:
feature_span (Tuple[int, int]): 官話特徵嘅位置 Mandarin feature position
loan_spans (List[Tuple[int, int]]): 借詞嘅位置 Loan word positions
Returns:
bool: 係唔係官話借詞 Whether the input feature is a Mandarin loan word
'''

for loan_span in loan_spans:
if feature_span[0] >= loan_span[0] and feature_span[1] <= loan_span[1]:
return True
return False


def is_all_loan(s: str) -> bool:
# 判斷一句話入面所有官話特徵係唔係都係借詞
# Judge whether all Mandarin features in a sentence are loan words.
'''
判斷一句話入面所有官話特徵係唔係都係借詞
Judge whether all Mandarin features in a sentence are loan words.
'''
mando_features = mando_feature.finditer(s)
mando_loans = mando_loan.finditer(s)
feature_spans = [m.span() for m in mando_features]
Expand All @@ -50,6 +60,15 @@ def is_all_loan(s: str) -> bool:


def judge(s: str) -> str:
'''
判斷一句話係粵語、官話、官話溝粵語定係中性
Judge whether a sentence is Cantonese, Mandarin, mixed-Mandarin-Cantonese, or neutral.
Args:
s (str): 一句話 A sentence
Returns:
str: 粵語、官話、官話溝粵語定係中性 `cantonese`, `mandarin`, `mixed`, or `neutral`.
'''
has_canto_unique = bool(re.search(canto_unique, s))
has_mando_unique = bool(re.search(mando_unique, s))
has_mando_feature = bool(re.search(mando_feature, s))
Expand Down Expand Up @@ -96,20 +115,3 @@ def judge(s: str) -> str:
# 冇任何特徵,既可以當粵語亦可以當官話
# No features, can be either Cantonese or Mandarin
return "neutral"


if __name__ == '__main__':
argparser = argparse.ArgumentParser(
description='Specify input text file with `--input <INPUT.txt>`, where each line is a sentence. ')
argparser.add_argument('--input', type=str, default='input.txt')
args = argparser.parse_args()

output = open('output.tsv', 'w', encoding="utf-8")

with open(args.input, encoding='utf-8') as f:
for line in f:
l = line.strip()
judgement = judge(l)
output.write('{}\t{}\n'.format(judgement, l))

output.close()
1 change: 1 addition & 0 deletions cantofilter/version.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__version__ = "1.0.1"
38 changes: 38 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
from setuptools import setup, find_packages
from cantofilter import __version__

# Read the contents of your README file
with open("README.md", "r", encoding="utf-8") as fh:
long_description = fh.read()

setup(
name="canto-filter",
version=__version__,
author="CanCLID (Cantonese Computational Linguistics Infrastructure Development Workgroup)",
author_email="[email protected]",
description="粵文分類篩選器 Cantonese text filter",
long_description=long_description,
long_description_content_type="text/markdown",
packages=find_packages(),
classifiers=[
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"License :: OSI Approved :: MIT License",
"Intended Audience :: Developers",
"Topic :: Text Processing :: Linguistic",
"Natural Language :: Cantonese",
"Operating System :: OS Independent",
],
python_requires=">=3.6",
entry_points={
"console_scripts": [
"cantofilter=cantofilter.cli:main", # 'command=package.module:function'
],
},
)
Empty file added tests/__init__.py
Empty file.
Loading

0 comments on commit 2c56043

Please sign in to comment.