A flexible sentence segmentation library using CRF model and regex rules
This library allows splitting of text paragraphs into sentences. It is built with the following desiderata:
- Be able to extend to new languages or "types" of sentences from data alone by learning a conditional random field (CRF) model.
- Also provide functionality to segment (or not to segment) lines based on regular expression rules (referred as
segment_regexes
andprevent_regexes
, respectively). - Be able to reconstruct the exact original text paragraphs from joining the segmented sentences.
All in all, the library aims to benefit from the best of both worlds: data-driven and rule-based approaches.
You can try out the library here.
Supports Python 3.7+
# stable
pip install sentsplit
# bleeding-edge
pip install git+https://github.com/zaemyung/sentsplit
Uses python-crfsuite, which, in turn, is built upon CRFsuite.
$ sentsplit segment -l lang_code -i /path/to/input_file # outputs to /path/to/input_file.segment
$ sentsplit segment -l lang_code -i /path/to/input_file -o /path/to/output_file
$ sentsplit segment -h # prints out the detailed usage
from sentsplit.segment import SentSplit
# use default setting
sent_splitter = SentSplit(lang_code)
# override default setting - see "Features" for detail
sent_splitter = SentSplit(lang_code, **overriding_kwargs)
# segment a single line
sentences = sent_splitter.segment(line)
# can also segment a list of lines
sentences = sent_splitter.segment([lines])
The behavior of segmentation can be adjusted by the following arguments:
mincut
: a line is not segmented if its character-level length is smaller thanmincut
, preventing too short sentences.maxcut
: a line is "heuristically" segmented if its character-level length is greater or equal tomaxcut
, preventing too long sentences.strip_spaces
: trim any white spaces in front and end of a sentence; does not guarantee exact reconstruction of original passages.handle_multiple_spaces
: substitute multiple spaces with a single space, perform segmentation, and recover the original spaces.segment_regexes
: segment at eitherstart
orend
index of the matched group defined by the regex patterns.prevent_regexes
: a line is not segmented at characters that fall within the matching group(s) captured by the regex patterns.prevent_word_split
: a line is not segmented at characters that are within a word where the word boundary is denoted by white spaces around it or a punctuation; may not be suitable for languages (e.g. Chinese, Japanese, Thai) that do not use spaces to differentiate words.
Segmentation is performed by first applying a trained CRF model to a line, where each character in the line is labelled as either O
or EOS
.
EOS
label indicates the position for segmentation.
Note that prevent_regexes
is applied after segment_regexes
, meaning that the segmentation positions captured by segment_regexes
can be overridden by prevent_regexes
.
Let's suppose we want to segment sentences that end with a tilde (~
or 〜
) which is often used in some East Asian countries to convey a sense of friendliness, silliness, whimsy or flirtatiousness.
We can devise a regex that looks something like this: (?<=[다요])~+(?= )
, where 다
and 요
are the most common characters that finish the sentences in the polite/formal form.
This regex can be added to segment_regexes
to take effect:
from copy import deepcopy
from sentsplit.config import ko_config
from sentsplit.segment import SentSplit
my_config = deepcopy(ko_config)
my_config['segment_regexes'].append({'name': 'tilde_ending', 'regex': r'(?<=[다요])~+(?= )', 'at': 'end'})
sent_splitter = SentSplit('ko', **my_config)
sent_splitter.segment('안녕하세요~ 만나서 정말 반갑습니다~~ 잘 부탁드립니다!')
# results with the regex: ['안녕하세요~', ' 만나서 정말 반갑습니다~~', ' 잘 부탁드립니다!']
# results without the regex: ['안녕하세요~ 만나서 정말 반갑습니다~~ 잘 부탁드립니다!']
To learn more about the regular expressions, this website provides a good tutorial.
Creating a new model involves first training a CRF model on a dataset of clean sentences, followed by (optionally) adding or modifying the feature arguments for better performance.
First, prepare a corpus file where a single line corresponds to a single sentence. Then, a CRF model can be trained by running a command:
sentsplit train -l lang_code -c corpus_file_path # outputs to {corpus_file_path}.{lang_code}-{ngram}-gram-{YearMonthDate}.model
sentsplit train -h # prints out the detailed usage
The following arguments are used to set the training setting:
ngram
: maximum ngram features used for CRF model; default is5
.crf_max_iteration
: maximum number of CRF iteration for training; default is50
.sample_min_length
: when preparing an input sample for CRF model, gold sentences are concatenated to form a longer sample with a length greater thansample_min_length
; default is450
.depunctuation_ratio
: ratio of training samples with no punctuation inbetween the sentences. May only be suitable for certain languages (e.g. "ko", "ja") that have specific endings for sentences. The top-num_depunctuation_endings
most common endings are computed fromcorpus
. 1.0 means 100% of the training samples are depunctuated.num_depunctuation_endings
: number of most common sentence endings to extract and use.ending_length
: length of sentence endings counted from reverse, exclusing any punctuation.despace_ratio
: ratio of training samples without whitespaces inbetween the sentences. 1.0 means 100% of the training samples are despaced. For languages that do not often use whitespaces, set this to a high value ~1.0.
Refer to the base_config
in config.py
. Append a new config to the file, adjusting the arguments accordingly if needed.
A newly created model can also be called directly in codes by passing the kwargs accordingly:
from sentsplit.segment import SentSplit
sent_splitter = SentSplit(lang_code, model='path/to/model', ...)
Currently supported languages are:
- English (
en
) - French (
fr
) - German (
de
) - Italian (
it
) - Japanese (
ja
) - Korean (
ko
) - Lithuanian (
lt
) - Polish (
pl
) - Portuguese (
pt
) - Russian (
ru
) - Simplified Chinese (
zh
) - Turkish (
tr
)
Please note that many of these languages are trained with openly available sentences gathered from bilingual corpora for machine translations. The training sentences for European languages are mostly from the Europarl corpora, so the default models may not handle colloquial sentences effectively. We can either train a new CRF model with more gold sentences from the target domain, or devise a set of domain-specific regex rules if need be.
sentsplit
is licensed under MIT license, as found in LICENSE file.