Bicleaner Hardrules 2.10.6:
- Reduced the maximum number of glued words permitted from 3 to 2.
- Relaxed bad encoding rule for more langs where "â" can happen (Romanian and Ligurian)
Bicleaner Hardrules 2.10.5:
- Bump FastSpell version.
Bicleaner Hardrules 2.10.4:
- Bumped HF hub version.
Bicleaner Hardrules 2.10.3:
- Bumped FastSpell version
Bicleaner Hardrules 2.10.2:
- Fix 2.10.1 build.
Bicleaner Hardrules 2.10.1:
- Fix division by 0 error on empty sentences.
- Fix rules that were giving false positives on empty sentences (no_titles, wrong_language)
- For performance, long sentences (> 10000 characters) are ignored by default, only "not_too_long" is outputed.
- Added "--dont_ignore_long" to override this behaviour
Bicleaner Hardrules 2.10.0:
- Update FastSpell to 0.10 with more languages support:
Bicleaner Hardrules 2.9.1:
- Fix hardrules crash without metadata.
Bicleaner Hardrules 2.9.0:
- Accept HF identifiers in
--metadata
argument.
Bicleaner Hardrules 2.8.1:
- Fix
no_url
regex - Fix builds with pip >= 23 using
fasttext-wheel
.
Bicleaner Hardrules 2.8.0:
- Update KenLM installation instructions
- Update FastSpell to 0.8
- Dictionaries installed as a dependency.
- Better coverage for Icelandic.
Bicleaner Hardrules 2.7.0:
- Relax unicode noise rule for Icelandic and Finish.
Bicleaner Hardrules 2.6.0:
- Update FastSpell to 0.5: some improvements for Slovene and Serbo-Croatian language detection.
Bicleaner Hardrules 2.5.1:
- Fix installation instructions.
- Freeze some dependencies.
Bicleaner Hardrules 2.5:
- Disable
no_urls
by default.
Bicleaner Hardrules 2.4:
- Update FastSpell
- Fix FastSpell imports.
- Improved
no_paren
rule. - Extended
no_escaped_unicode
rule. - More aggressive url filtering.
Bicleaner Hardrules 2.3:
- Automated KenLM build.
- Check lenght ratio with characters in non-CJK.
Bicleaner Hardrules 2.2:
- Refinement of minimum length and repeated words for CJK.
- Filter sentences with inconsistencies in numbers (disabled by default)
- Filter sentences with characters in differents scripts/writing systems (disabled by default)
Bicleaner Hard-rules 2.0:
- Parametrized hardrules: now each rule can be enabled or disabled via YAML config file.
- Run all mode: run all rules instead of stopping in the first discard.
- New hardrule: discard sentences that contain repeated words
- Avoid downloading multiple fasttext models in parallel on first run.
Bicleaner Hard-rules 1.3.1:
- Fix PyPi release.
Bicleaner Hard-rules 1.3:
- Filter bad encoding issues with à and Â
- Change identical rules with a single identical without non-alpha
- Return exit code 1 when a process encounters an error.
- Tag as wrong the sentence pairs with an empty side.
- Language identifier is now FastSpell
- Tag as wrong the sentence pairs with wrong number of columns.
Bicleaner Hard-rules 1.2:
- Add
--score_only
mode.
Bicleaner Hard-rules 1.1:
- Separate
wrong_tu
code. - Load
lm
only when necessary.
Bicleaner Hard-rules 1.0:
- Split Hardrules into a separate package.
Bicleaner 0.14:
- Bicleaner hardrules changes:
- New rule: filter out sentences containing gluedWordsLikeThis.
- Rule change: Relaxed c_different_language rule for similar languages.
- New rule: filter out porn sentences using FastText classifier.
- Parameters changed:
-s/--source_lang
and-t/--target_lang
are no longer mandatory (if a metadata .yaml file is provided)
- Other
- Now using sacremoses instead of mosestokenizer
Bicleaner 0.13:
- Bicleaner hardrules changes:
- Rule change: Relaxed c_minimal_length to accept 3-word sentences
- New feature: LM filtering (moved from Bicleaner Classify)
- New parameter:
--disable_lm_filter
,--metadata
and--lm_threshold
, to support LM filtering
- Other:
- Updated requirements
Bicleaner 0.12:
- Bicleaner hardrules changes:
- New rule: c_identical_wo_punct to reject sentences only different in punctuation (and it's case insensitive)
- New rule: Sentences containing "Re:" are rejected
- Rule change: c_minimal_length now rejects sentences with both sides <= 3 words (instead of only one)
- Rule change: c_identical and c_identical_wo_digits now is case insensitive
- Rule change: Breadcrumbs rule now split into c_no_breadcrumbs1 and c_no_breadcrumbs2
- Rule change: Breadcrumbs2 now includes character "·" in the rejected characters
- Rule change: c_length now compares byte length ratio (will avoid rejecting valid sentences due to length ratio when comparing languages with different alphabets)
- Changed behaviour for
--annotated_output
argument in hardrules. See README.md for more information. - New parameter:
--disable_lang_ident
flag to avoid applying rules that need to identify the language