Skip to content

Commit

Permalink
Release v0.4.0 (#63)
Browse files Browse the repository at this point in the history
* Add descriptions of variants of DaachorseError

* Fix the builder to work with fixed-size helpers (#28)

* access extra via member func

* use only FREE_STATES elements

* add #[must_use]

* fix by clippy

* Update src/builder.rs

Co-authored-by: Koichi Akabe <[email protected]>

Co-authored-by: Koichi Akabe <[email protected]>

* Address empty patterns (#29)

* handle empty patterns

* move some test

* fix following clippy

* Add basic parts of charwise daachorse (#31)

* add api

* fix

* Update src/charwise/mapper.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise/mapper.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise.rs

Co-authored-by: Koichi Akabe <[email protected]>

* add lifetime param

* rename

* add no_suffix

* Update src/charwise/iter.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise/iter.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise/iter.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise/iter.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise/iter.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise/iter.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise/iter.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise/iter.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise/iter.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise/iter.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise/iter.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise/iter.rs

Co-authored-by: Koichi Akabe <[email protected]>

* rm new

* Update src/charwise/mapper.rs

Co-authored-by: Koichi Akabe <[email protected]>

Co-authored-by: Koichi Akabe <[email protected]>

* Add trait of original NFA builder (#32)

* generalize sparse_nfa

* add comments (and minor)

* move for github diff

* add comment

* unify

* add iter

* add error handling

* add comment

* Update src/builder.rs

Co-authored-by: Koichi Akabe <[email protected]>

* dyn dispatch -> enum dispatch

* fix

* fix the generalization

* add wrapper for EdgeMapIter

* add default to SparseNfaBuilderState

* rm clone_to_vec

* add EdgeMap for chars

* use BTreeMap for EdgeMap

* minor

* use type alias for EdgeMap

Co-authored-by: Koichi Akabe <[email protected]>

* Use stack to traverse (#34)

* Use RefCell to avoid cloning edges (#33)

* Avoid to store unnecessary pointers in construction (#35)

* use u8vec for labels

* Update src/builder.rs

Co-authored-by: Koichi Akabe <[email protected]>

Co-authored-by: Koichi Akabe <[email protected]>

* Add test for input order (#36)

* add test

* rm clone

* enhance

* Separate NFA builder into another file (#37)

* separate nfa_builder

* rm dependency

* handle error msg (#39)

* Move tests to src/tests (#38)

* Add charwise builder (#40)

* add builder and freq

* add builder

* substract

* add mapper argument

* add comment

* Add Result type alias (#41)

* Add Result type alias

* update

* Remove Default implementation of MatchKind (#42)

* Implement mappers and examples (#43)

* add mappers

* add tests

* add examples

* fix

* add example

* substract

* Update src/charwise/mapper.rs

Co-authored-by: Koichi Akabe <[email protected]>

* implement a naive dat

* fix

* implement a naive dat

* fix

* fix length bug

* rm FreqMapper

* modify examples with multibyte chars

* Update src/nfa_builder.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/nfa_builder.rs

Co-authored-by: Koichi Akabe <[email protected]>

Co-authored-by: Koichi Akabe <[email protected]>

* Move integration tests to tests directory (#45)

* Move integration tests to tests directory

* Add missing file

* Remove unnecessary file

* Simplify test module on random strings (#44)

* add naive find funcs

* fix args

* fix bug for findIter

* Add leftmost-first random test (#46)

* add leftmost-first

* fix

* rm minmax

* simplify

* move (#47)

* Add duplicate pattern tests (#49)

* Remove mapper (#50)

* rm mapper

* fix

* Add leftmost iterators of charwise version and examples (#51)

* add leftmost

* minor

* Add integration tests for charwise daachorse (#52)

* add charwise

* add tests

* add charwise bench (#53)

* make unsafe (#54)

* Refactor DaachorseError (#55)

* Refactor DaachorseError

* fix

* fix

* fmt

* Add `_with_iter()` functions (#56)

* Add U8SliceIterator and use it in each FindIterator

* Add _with_iter() functions

* with -> from

* Refactoring

* Add `_from_iter()` functions for charwise automata (#57)

* Add U8SliceIterator and use it in each FindIterator

* Add _with_iter() functions

* with -> from

* Add CharWithEndOffsetIterator

* Add from_iter functions

* Add inline

* Refactoring

* fix

* clippy

* Add a test for CharWithEndOffsetIterator (#58)

* Enhance documents for charwise version (#59)

* add example

* minor

* add

* add doc

* fix

* fix linkage

* minor

* add Requirements

* add

* minor

* minor

* Update README.md

Co-authored-by: Koichi Akabe <[email protected]>

* Update README.md

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/lib.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/lib.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/lib.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update src/charwise.rs

Co-authored-by: Koichi Akabe <[email protected]>

* Update README.md

Co-authored-by: Koichi Akabe <[email protected]>

Co-authored-by: Koichi Akabe <[email protected]>

* Add codes to measure memory usages (#60)

* add memory stats

* fix

* Bump up to 0.4.0 (#61)

* Bump up to 0.4.0

* Fix README

* Update figures (#62)

* Update figures

* Update README.md

Co-authored-by: Shunsuke Kanda <[email protected]>
  • Loading branch information
vbkaisetsu and kampersanda authored Feb 2, 2022
1 parent e05a245 commit e47fec1
Show file tree
Hide file tree
Showing 28 changed files with 3,768 additions and 1,222 deletions.
5 changes: 2 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,19 +1,18 @@
[package]
name = "daachorse"
version = "0.3.0"
version = "0.4.0"
edition = "2021"
authors = [
"Koichi Akabe <[email protected]>",
"Shunsuke Kanda <[email protected]>",
]
description = "Daac Horse: Double-Array Aho-Corasick"
description = "Daachorse: Double-Array Aho-Corasick"
license = "MIT OR Apache-2.0"
homepage = "https://github.com/legalforce-research/daachorse"
repository = "https://github.com/legalforce-research/daachorse"
readme = "README.md"
keywords = ["string", "search", "text", "aho", "multi"]
categories = ["text-processing"]
autotests = false
exclude = [".*"]

[dependencies]
Expand Down
36 changes: 33 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ but also represents each state in a compact space of only 12 bytes.

For example, compared to the NFA of the [aho-corasick](https://github.com/BurntSushi/aho-corasick) crate
that is the most poplar Aho-Corasick implementation in Rust,
Daachorse can perform pattern matching **3.1 times faster**
while consuming **45% smaller** memory, when using a word dictionary of 675K patterns.
Daachorse can perform pattern matching **3.0~5.1 times faster**
while consuming **45~55% smaller** memory, when using a word dictionary of 675K patterns.
Other experimental results can be found in
[Wiki](https://github.com/legalforce-research/daachorse/wiki).

Expand All @@ -33,9 +33,13 @@ To use `daachorse`, depend on it in your Cargo manifest:
# Cargo.toml

[dependencies]
daachorse = "0.3"
daachorse = "0.4"
```

### Requirements

To compile this crate, Rust 1.58 or higher is required.

## Example usage

Daachorse contains some search options,
Expand Down Expand Up @@ -169,6 +173,32 @@ assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));
assert_eq!(None, it.next());
```

### Building faster automaton on multibyte characters

To build a faster automaton on multibyte characters, use `CharwiseDoubleArrayAhoCorasick` instead.

The standard version `DoubleArrayAhoCorasick` handles strings as UTF-8 sequences
and defines transition labels using byte values.
On the other hand, `CharwiseDoubleArrayAhoCorasick` uses code point values of Unicode,
resulting in reducing the number of transitions and faster matching.

```rust
use daachorse::charwise::CharwiseDoubleArrayAhoCorasick;

let patterns = vec!["全世界", "世界", ""];
let pma = CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();

let mut it = pma.find_iter("全世界中に");

let m = it.next().unwrap();
assert_eq!((0, 9, 0), (m.start(), m.end(), m.value()));

let m = it.next().unwrap();
assert_eq!((12, 15, 2), (m.start(), m.end(), m.value()));

assert_eq!(None, it.next());
```

## CLI

This repository contains a command line interface named `daacfind` for searching patterns in text files.
Expand Down
10 changes: 7 additions & 3 deletions bench/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,18 @@ version = "0.1.0"
edition = "2021"

[dependencies]

[dev-dependencies]
aho-corasick = "0.7.18" # Unlicense or MIT
criterion = { version = "0.3", features = ["html_reports"] } # Apache-2.0 or MIT
daachorse = { path = ".." } # Apache-2.0 or MIT
fst = "0.4.7" # Unlicense or MIT
yada = "0.5.0" # Apache-2.0 or MIT

[dev-dependencies]
criterion = { version = "0.3", features = ["html_reports"] } # Apache-2.0 or MIT

[[bench]]
name = "benchmark"
harness = false

[[bin]]
name = "memory"
path = "src/memory.rs"
85 changes: 85 additions & 0 deletions bench/benches/benchmark.rs
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,10 @@ fn add_build_benches(group: &mut BenchmarkGroup<WallTime>, patterns: &[String])
b.iter(|| daachorse::DoubleArrayAhoCorasick::new(patterns).unwrap());
});

group.bench_function("daachorse/charwise", |b| {
b.iter(|| daachorse::charwise::CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap());
});

group.bench_function("aho_corasick/nfa", |b| {
b.iter(|| aho_corasick::AhoCorasick::new(patterns));
});
Expand Down Expand Up @@ -197,6 +201,21 @@ fn add_find_benches(
});
});

group.bench_function("daachorse/charwise", |b| {
let pma = daachorse::charwise::CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();
b.iter(|| {
let mut sum = 0;
for haystack in haystacks {
for m in pma.find_iter(haystack) {
sum += m.start() + m.end() + m.value();
}
}
if sum == 0 {
panic!();
}
});
});

group.bench_function("aho_corasick/nfa", |b| {
let pma = aho_corasick::AhoCorasick::new(patterns);
b.iter(|| {
Expand Down Expand Up @@ -265,6 +284,36 @@ fn add_find_overlapping_benches(
});
});

group.bench_function("daachorse/charwise", |b| {
let pma = daachorse::charwise::CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();
b.iter(|| {
let mut sum = 0;
for haystack in haystacks {
for m in pma.find_overlapping_iter(haystack) {
sum += m.start() + m.end() + m.value();
}
}
if sum == 0 {
panic!();
}
});
});

group.bench_function("daachorse/charwise/no_suffix", |b| {
let pma = daachorse::charwise::CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();
b.iter(|| {
let mut sum = 0;
for haystack in haystacks {
for m in pma.find_overlapping_no_suffix_iter(haystack) {
sum += m.start() + m.end() + m.value();
}
}
if sum == 0 {
panic!();
}
});
});

group.bench_function("aho_corasick/nfa", |b| {
let pma = aho_corasick::AhoCorasick::new(patterns);
b.iter(|| {
Expand Down Expand Up @@ -373,6 +422,24 @@ fn add_leftmost_longest_find_benches(
});
});

group.bench_function("daachorse/charwise", |b| {
let pma = daachorse::charwise::CharwiseDoubleArrayAhoCorasickBuilder::new()
.match_kind(daachorse::MatchKind::LeftmostLongest)
.build(patterns)
.unwrap();
b.iter(|| {
let mut sum = 0;
for haystack in haystacks {
for m in pma.leftmost_find_iter(haystack) {
sum += m.start() + m.end() + m.value();
}
}
if sum == 0 {
panic!();
}
});
});

group.bench_function("aho_corasick/nfa", |b| {
let pma = aho_corasick::AhoCorasickBuilder::new()
.match_kind(aho_corasick::MatchKind::LeftmostLongest)
Expand Down Expand Up @@ -432,6 +499,24 @@ fn add_leftmost_first_find_benches(
});
});

group.bench_function("daachorse/charwise", |b| {
let pma = daachorse::charwise::CharwiseDoubleArrayAhoCorasickBuilder::new()
.match_kind(daachorse::MatchKind::LeftmostFirst)
.build(patterns)
.unwrap();
b.iter(|| {
let mut sum = 0;
for haystack in haystacks {
for m in pma.leftmost_find_iter(haystack) {
sum += m.start() + m.end() + m.value();
}
}
if sum == 0 {
panic!();
}
});
});

group.bench_function("aho_corasick/nfa", |b| {
let pma = aho_corasick::AhoCorasickBuilder::new()
.match_kind(aho_corasick::MatchKind::LeftmostFirst)
Expand Down
82 changes: 82 additions & 0 deletions bench/src/memory.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
use std::convert::TryFrom;
use std::fs::File;
use std::io::BufRead;
use std::io::BufReader;
use std::path::Path;

fn main() {
{
println!("== data/words_100000 ==");
let mut patterns = load_file("data/words_100000");
patterns.sort_unstable();
show_memory_stats(&patterns);
}
{
println!("== data/unidic/unidic ==");
let mut patterns = load_file("data/unidic/unidic");
patterns.sort_unstable();
show_memory_stats(&patterns);
}
}

fn show_memory_stats(patterns: &[String]) {
{
let pma = daachorse::DoubleArrayAhoCorasick::new(patterns).unwrap();
format_memory("daachorse (bytewise)", pma.heap_bytes());
}
{
let pma = daachorse::charwise::CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();
format_memory("daachorse (charwise)", pma.heap_bytes());
}
{
let pma = aho_corasick::AhoCorasick::new(patterns);
format_memory("aho_corasick (nfa)", pma.heap_bytes());
}
{
let pma = aho_corasick::AhoCorasickBuilder::new()
.dfa(true)
.build(patterns);
format_memory("aho_corasick (dfa)", pma.heap_bytes());
}
{
let fst = fst::raw::Fst::from_iter_map(
patterns
.iter()
.cloned()
.enumerate()
.map(|(i, pattern)| (pattern, i as u64)),
)
.unwrap();
format_memory("fst", fst.as_bytes().len());
}
{
let data = yada::builder::DoubleArrayBuilder::build(
&patterns
.iter()
.cloned()
.enumerate()
.map(|(i, pattern)| (pattern, u32::try_from(i).unwrap()))
.collect::<Vec<_>>(),
)
.unwrap();
format_memory("yada", data.len());
}
}

fn format_memory(title: &str, bytes: usize) {
println!(
"{}: {} bytes, {:.3} MiB",
title,
bytes,
bytes as f64 / (1024.0 * 1024.0)
);
}

fn load_file<P>(path: P) -> Vec<String>
where
P: AsRef<Path>,
{
let file = File::open(path).unwrap();
let buf = BufReader::new(file);
buf.lines().map(|line| line.unwrap()).collect()
}
26 changes: 16 additions & 10 deletions daacfind/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -169,23 +169,29 @@ fn main() -> Result<(), Box<dyn std::error::Error>> {
let line = match line {
Ok(line) => line,
Err(err) => {
if let Some(filename) = filename {
eprintln!("{}: {:?}", filename, err);
} else {
eprintln!("{:?}", err);
}
filename.map_or_else(
|| {
eprintln!("{:?}", err);
},
|filename| {
eprintln!("{}: {:?}", filename, err);
},
);
break;
}
};
find_and_output(&pma, &line, filename, line_number, opt.color, &mut stdout)?;
}
}
Err(err) => {
if let Some(filename) = filename.to_str() {
eprintln!("{}: {:?}", filename, err);
} else {
eprintln!("{:?}", err);
}
filename.to_str().map_or_else(
|| {
eprintln!("{:?}", err);
},
|filename| {
eprintln!("{}: {:?}", filename, err);
},
);
}
}
}
Expand Down
Loading

0 comments on commit e47fec1

Please sign in to comment.