Optimize Hypher setup and hyphenation #1

aptomaKetil · 2024-03-15T10:13:00Z

The Hyper.prototype.createTrie-function is sailing up as one of the biggest timesinks on regular autopublishes. I've picked some autopublish-configs at random for a couple of customers and ran them through profiling in FF, and we're typically spending around 20-25% of the actual rendering just setting up Hypher. This function is also responsible for ~35% of the total memory allocations during these runs. This applies to typical full-page stories that result in half a dozen layouts, which is our most common type of autopublish.

These changes brings that down to around 6-7% while also cutting allocations in half:

I've added a benchmark for createTrie, which shows about an 80% improvement with these changes. The changes I've made to hyphenate show a 60% improvement on the benchmark, but this will not be as noticeable for us since we aggressively memoize these results anyway.

I will try my best to explain the changes to createTrie, to the extent I understand this function myself. This StackOverflow answer is a good primer on TeX-style hyphenation as a concept. Hypher pre-groups the patterns by length, which is why the pattern files look different from what you'd find on hyphenation.org, but the patterns themselves are the same. Hypher also uses _ to represent word boundaries, while TeX uses ., but this appears to only be a cosmetic change.

What createTrie currently does, is use regex to break each pattern group into sections according to their length. E.g. if the pattern group for length 3 starts with 2aaa1äa1ba1da1ga1j2, this regex would break that into an array of ['2aa', 'a1ä', 'a1b', 'a1d', ...] and so on (line 66 on master). This obviously causes a whole bunch of allocations of tiny strings, which the GC will need to deal with later. It then additionally used regex to break each pattern segment into arrays of chars and points. E.g. the pattern segment a1b would be broken into the arrays ['a', 'b'] and ['', '1', ''], again causing a ton of tiny allocations.

This has been changed to two for-loops, iterating over the original pattern string in a sliding window, where we also inspect each character in the pattern segment directly instead of breaking them up into separate strings and arrays beforehand. This completely eliminates all the intermediate allocations, so we're only allocating for the final tree-structure. We also avoid any use of regex. A side-effect of this is that the point-arrays will no longer get a trailing zero-score, which seems to make no difference to the results or tests, but saves an iteration when hyphenating words, leading to a significant performance increase by complete accident.

The change to hyphenate is very simple. The function currently does two completely unnecessary string-splits, to convert a string into an array of chars, when you can just index the string directly and get the same result.

All 4577(!) test cases in hypher still pass. Alf-tests pass, LP tests and playwright-tests pass. We should test this on dev before going any further though. If this passes review, I think the best course of action is to publish this on @aptoma/hypher, and then deploy that to dev for verification. The source project has been dead for six years, which is why I'm not opening a PR there, but we can consider that later if this ends up working out.

aptomaKetil added 3 commits March 15, 2024 08:12

Add benchmark for createTrie

4b75275

Optimize Hypher

84ea86e

Remove unused dependency

802b199

aptomaKetil requested a review from peterudo March 15, 2024 10:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Hypher setup and hyphenation #1

Optimize Hypher setup and hyphenation #1

aptomaKetil commented Mar 15, 2024

Optimize Hypher setup and hyphenation #1

Are you sure you want to change the base?

Optimize Hypher setup and hyphenation #1

Conversation

aptomaKetil commented Mar 15, 2024