StringZilla 4.0! #201

ashvardanian · 2024-12-07T12:00:30Z

This PR entirely refactors the codebase and separates the single-header implementation into separate headers. Moreover, it brings faster kernels for:

Sorting of string sequences and pointer-sized integers,
Levenshtein edit distances for DNA alignment and UTF-8 fuzzy matching,
Needleman-Wunsch pairwise global alignment for proteins,
AES-based hashing functions,
Multi-pattern search,

And more community contributions:

Detecting CPU capabilities 👏 @GoWind - Feat: #143 Inline ASM for detecting CPU features on ARM #196
Windows cross-compilation 👏 @ashbob999 - Added windows cross compile builds & fixed build issues #169
CMake refactor 👏 @friendlyanon - Fix the CMake code of the project #85
Charset initialization 👏 @alexbarev - Bug: Last elements in basic_charset initialization are discarded #200
Benchmarking sorting algorithms 👏 @ashbob999 Fix: hybrid bench sort issues #209

Why Split the Files? Matching SimSIMD Design

Sadly, most of the modern software development tooling stinks. VS Code is just as slow and unresponsive as the older Atom and the other web-based technologies, while LSP implementations for C++ are equally slow and completely mess up code highlighting for files over 5,000 Lines Of Code (LOCs). So, I've unbundled the single-header solution into multiple headers, similar to SimSIMD.

Also, similar to SimSIMD, CPU feature detection has been reworked to separate serial implementations, Haswell, Skylake, Ice Lake, NEON, and SVE.

Faster Sorting

Our old algorithm didn't perform any memory allocations and tried to fit too much into the provided buffers. The new breaking change in the API allows passing a memory allocator, making the implementation more flexible. It now works fine on 32-bit systems as well.

The new serial algorithm is often 5x faster than the std::sort of C++ Standard Templates Library for a vector of strings. It's also often 10x faster than the qsort_r in the GNU C library. There are even faster versions available for Ice Lake CPUs with AVX-512 and Arm CPUs with SVE.

Faster Sequence Alignment & Scoring

Faster Hashing Algorithms

Multi-Pattern Search

…into main-dev

`sz_checksum`, `sz_hash`, `sz_edit_distance_utf8`, `sz_edit_distance_bounded`, `sz_edit_distance_utf8_bounded`.

Closes #143

…into main-dev

…lla/types.h

…lla/find.h

…lla/hash.h

…lla/similarity.h

…lla/small_string.h

…gerprint.cpp

…into main-dev

Benchmarks on Sapphire Rapids suggest: - For 8.3 M words in Leipzig1M.txt of length ~5 -- `std::sort` is 2 seconds -- `sz_sort_serial` is 0.6 seconds -- `qsort_r` is 3.2 seconds - For 268 M words in XLSum.csv of length ~8 -- `std::sort` is 147 seconds -- `sz_sort_serial` is 29 seconds -- `qsort_r` is 192 seconds

Makes it easier to differentiate stable `sz_msort`

This huge commit brings many new sorting APIs, as well as a new naming convention to differentiate inplace sorting helpers from "argsort" operations. Also refactors the testing and micro-benchmarking helpers.

It yields no noticeable performance improvements

ashvardanian and others added 30 commits November 30, 2024 17:39

Improve: #pragma region dashes

fe4449b

Fix: sz_look_up_transform_avx512 declaration

585f7d5

Merge branch 'main-dev' of https://github.com/ashvardanian/StringZilla …

4fa591b

…into main-dev

Docs: Levenshtein tutorial in Jupyter

715ad10

Improve: Levenshtein functions for unicode

d3b423a

Add: Missing Rust interfaces

1765f33

`sz_checksum`, `sz_hash`, `sz_edit_distance_utf8`, `sz_edit_distance_bounded`, `sz_edit_distance_utf8_bounded`.

Fix: Default Levenshtein upper bound

62ca6a0

Make: Inline ASM for detecting CPU features on ARM

0ee549a

Closes #143

Add: New Levenshtein distance kernels

43471aa

Fix: Wrong env. variable names

d0678f8

Merge branch 'main-dev' of https://github.com/ashvardanian/StringZilla …

7b44e87

…into main-dev

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

ecb3775

…lla/types.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

22e3d1e

…lla/types.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

bd54745

…lla/types.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

8cb0742

…lla/types.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

9e577be

…lla/find.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

14ba3bf

…lla/find.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

b007ba5

…lla/find.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

974ed78

…lla/find.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

9e9f256

…lla/hash.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

08d0a20

…lla/hash.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

a6768af

…lla/hash.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

1f60e6d

…lla/hash.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

d74e5dc

…lla/similarity.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

10d829e

…lla/similarity.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

7fdc58f

…lla/similarity.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

e23c35f

…lla/similarity.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

3f9c248

…lla/small_string.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

89c4681

…lla/small_string.h

Make: Split ./include/stringzilla/stringzilla.h to ./include/stringzi…

5d0d2da

…lla/small_string.h

ashvardanian added 30 commits February 15, 2025 00:10

Break: sz_sort now takes allocators

ec81663

Fix: Tail sum order in checksum_haswell

b20d7cd

Improve: Validate checksums in benchmark

abe8d07

Improve: Wrap std::accumulate for checksums

bce107a

Docs: Signatures and typos

982dd4d

Make: Renamed scripts/bench_token.cpp -> scripts/bench_fingerprint.cpp

a0318eb

Make: Renamed scripts/bench_token.cpp -> temp-git-split-file

07d2239

Make: Merging history of scripts/bench_token.cpp -> scripts/bench_fin…

df5786f

…gerprint.cpp

Make: Renamed temp-git-split-file -> scripts/bench_token.cpp

031bedf

Improve: Separate fingerprinting benchmarks

187e0bd

Fix: Sorting benchmarks for new API

66f2ac9

Merge branch 'main-dev' of https://github.com/ashvardanian/StringZilla …

74f8de7

…into main-dev

Fix: In C++11 constexpr constructor must be empty

13bace2

Fox: C library build

eab3137

Fix: uniform_int_distribution upper bound

17f28a3

Make: Recommend pretty-printing GDB symbols

a818f97

Fix: Underflow in serial sorting

5970fa4

Improve: Drop hybrid sort code

50d8291

Add: String sorting tests for different lengths

c670ccd

Fix: sz_sort_serial passes for same length inputs

0fda5a5

Fix: uniform_int_distribution lower bound

bdee111

Improve: Rename sz_sort to sz_qsort

6191cc6

Makes it easier to differentiate stable `sz_msort`

Improve: Introduce typed _sz_swap macro

dcf6c65

Break: Pointer-sized N-gram Sorting

0c38bff

This huge commit brings many new sorting APIs, as well as a new naming convention to differentiate inplace sorting helpers from "argsort" operations. Also refactors the testing and micro-benchmarking helpers.

Fix: Merge-step bug in stable sort

db61d93

Improve: Expose Insertion-sort helpers

a38867f

Add: Smaller Sorting Networks

cd6859a

It yields no noticeable performance improvements

Break: checkum to bytesum, new hash, and PRNG

71f1f4b

Add: AES-based hash placeholders

cb18c78

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StringZilla 4.0! #201

StringZilla 4.0! #201

ashvardanian commented Dec 7, 2024 •

edited

Loading

StringZilla 4.0! #201

Are you sure you want to change the base?

StringZilla 4.0! #201

Conversation

ashvardanian commented Dec 7, 2024 • edited Loading

Why Split the Files? Matching SimSIMD Design

Faster Sorting

Faster Sequence Alignment & Scoring

Faster Hashing Algorithms

Multi-Pattern Search

ashvardanian commented Dec 7, 2024 •

edited

Loading