Optimize Hypher setup and hyphenation #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
Hyper.prototype.createTrie
-function is sailing up as one of the biggest timesinks on regular autopublishes. I've picked some autopublish-configs at random for a couple of customers and ran them through profiling in FF, and we're typically spending around 20-25% of the actual rendering just setting up Hypher. This function is also responsible for ~35% of the total memory allocations during these runs. This applies to typical full-page stories that result in half a dozen layouts, which is our most common type of autopublish.These changes brings that down to around 6-7% while also cutting allocations in half:
I've added a benchmark for
createTrie
, which shows about an 80% improvement with these changes. The changes I've made tohyphenate
show a 60% improvement on the benchmark, but this will not be as noticeable for us since we aggressively memoize these results anyway.I will try my best to explain the changes to
createTrie
, to the extent I understand this function myself. This StackOverflow answer is a good primer on TeX-style hyphenation as a concept. Hypher pre-groups the patterns by length, which is why the pattern files look different from what you'd find on hyphenation.org, but the patterns themselves are the same. Hypher also uses_
to represent word boundaries, while TeX uses.
, but this appears to only be a cosmetic change.What
createTrie
currently does, is use regex to break each pattern group into sections according to their length. E.g. if the pattern group for length 3 starts with2aaa1äa1ba1da1ga1j2
, this regex would break that into an array of['2aa', 'a1ä', 'a1b', 'a1d', ...]
and so on (line 66 on master). This obviously causes a whole bunch of allocations of tiny strings, which the GC will need to deal with later. It then additionally used regex to break each pattern segment into arrays of chars and points. E.g. the pattern segmenta1b
would be broken into the arrays['a', 'b']
and['', '1', '']
, again causing a ton of tiny allocations.This has been changed to two for-loops, iterating over the original pattern string in a sliding window, where we also inspect each character in the pattern segment directly instead of breaking them up into separate strings and arrays beforehand. This completely eliminates all the intermediate allocations, so we're only allocating for the final tree-structure. We also avoid any use of regex. A side-effect of this is that the point-arrays will no longer get a trailing zero-score, which seems to make no difference to the results or tests, but saves an iteration when hyphenating words, leading to a significant performance increase by complete accident.
The change to
hyphenate
is very simple. The function currently does two completely unnecessary string-splits, to convert a string into an array of chars, when you can just index the string directly and get the same result.All 4577(!) test cases in hypher still pass. Alf-tests pass, LP tests and playwright-tests pass. We should test this on dev before going any further though. If this passes review, I think the best course of action is to publish this on
@aptoma/hypher
, and then deploy that to dev for verification. The source project has been dead for six years, which is why I'm not opening a PR there, but we can consider that later if this ends up working out.