Skip to content
/ segtok Public

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features

Notifications You must be signed in to change notification settings

xamgore/segtok

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

segtok

Segtok is a fast, rule-based sentence segmentation and tokenization library for well-orthographed texts, particularly in English, German, and Romance languages.

  • Unicode support
  • High precision for well-orthographed texts
  • Minimal false positives
  • Handles complex sentence boundaries
  • Handles technical texts and URLs

It minimizes false positives, handles complex sentence structures, technical terms, and URLs, and supports Unicode. It’s lightweight, customizable for developers, and integrates easily into Unix-based workflows. Segtok is ideal for processing structured, regular texts where precision and speed are crucial.

Ported from the python package (not maintained anymore), and fixes a few bugs not fixed there. You may want to read about why segtok was made.

Example

use segtok::{segmenter::*, tokenizer::*};

fn main() {
  let input = include_str!("../tests/test_google.txt");

  let sentences: Vec<Vec<_>> = split_multi(input, SegmentConfig::default())
    .into_iter()
    .map(|span| split_contractions(web_tokenizer(&span)).collect())
    .collect();
}

About

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features

Topics

Resources

Stars

Watchers

Forks

Languages