Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Hypher setup and hyphenation #1

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Optimize Hypher setup and hyphenation #1

wants to merge 3 commits into from

Conversation

aptomaKetil
Copy link

The Hyper.prototype.createTrie-function is sailing up as one of the biggest timesinks on regular autopublishes. I've picked some autopublish-configs at random for a couple of customers and ran them through profiling in FF, and we're typically spending around 20-25% of the actual rendering just setting up Hypher. This function is also responsible for ~35% of the total memory allocations during these runs. This applies to typical full-page stories that result in half a dozen layouts, which is our most common type of autopublish.

Screenshot 2024-03-15 at 10 00 35

These changes brings that down to around 6-7% while also cutting allocations in half:

Screenshot 2024-03-15 at 10 00 26

I've added a benchmark for createTrie, which shows about an 80% improvement with these changes. The changes I've made to hyphenate show a 60% improvement on the benchmark, but this will not be as noticeable for us since we aggressively memoize these results anyway.

I will try my best to explain the changes to createTrie, to the extent I understand this function myself. This StackOverflow answer is a good primer on TeX-style hyphenation as a concept. Hypher pre-groups the patterns by length, which is why the pattern files look different from what you'd find on hyphenation.org, but the patterns themselves are the same. Hypher also uses _ to represent word boundaries, while TeX uses ., but this appears to only be a cosmetic change.

What createTrie currently does, is use regex to break each pattern group into sections according to their length. E.g. if the pattern group for length 3 starts with 2aaa1äa1ba1da1ga1j2, this regex would break that into an array of ['2aa', 'a1ä', 'a1b', 'a1d', ...] and so on (line 66 on master). This obviously causes a whole bunch of allocations of tiny strings, which the GC will need to deal with later. It then additionally used regex to break each pattern segment into arrays of chars and points. E.g. the pattern segment a1b would be broken into the arrays ['a', 'b'] and ['', '1', ''], again causing a ton of tiny allocations.

This has been changed to two for-loops, iterating over the original pattern string in a sliding window, where we also inspect each character in the pattern segment directly instead of breaking them up into separate strings and arrays beforehand. This completely eliminates all the intermediate allocations, so we're only allocating for the final tree-structure. We also avoid any use of regex. A side-effect of this is that the point-arrays will no longer get a trailing zero-score, which seems to make no difference to the results or tests, but saves an iteration when hyphenating words, leading to a significant performance increase by complete accident.

The change to hyphenate is very simple. The function currently does two completely unnecessary string-splits, to convert a string into an array of chars, when you can just index the string directly and get the same result.

All 4577(!) test cases in hypher still pass. Alf-tests pass, LP tests and playwright-tests pass. We should test this on dev before going any further though. If this passes review, I think the best course of action is to publish this on @aptoma/hypher, and then deploy that to dev for verification. The source project has been dead for six years, which is why I'm not opening a PR there, but we can consider that later if this ends up working out.

@aptomaKetil aptomaKetil requested a review from peterudo March 15, 2024 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant