Don't include unused fonts in the PDF document #134

dnlmlr · 2023-03-02T16:04:47Z

So I have seen some talk about using allsorts to do actual subsetting which would significantly reduce the PDF size. This is not actually removing glyphs but at least it is omitting completely unused fonts from the PDF output.

I am using the PdfLayerReference::set_font function to mark fonts as used and then skip unmarked fonts in FontList::into_with_document. This is a rather simple hack, but it allows for adding all the fonts you want without wasting space on fonts / font variants that are not used at all. The main usecase for me is with the genpdf create where it is required to add all 4 variants of a font (regular, italic, bold, italic-bold) even if not all of them are used.

Let me know if there is something I missed here which could cause problems

dnlmlr · 2023-03-02T21:27:53Z

So I actually looked a bit into subsetting with allsorts and managed to subset the fonts externally before adding them to the PDF already. This is not optimal since it is not using the PDF properties for subsetting and instead basically just creates a new font, but as long it reliably works for reducing the filesize I'd say it is an option.

There currently is a problem with the allsorts subsetting implementation that causes the output font to be missing a few of the data tables. I found that it still works in the PDF files on all devices and programs I tested so far, as long as the font uses the Unicode type for the cmap entries.

I'll look a bit more into this and do some more testing for my own project. Let me know if you would be interested in an implementation of this into the crate. A pretty simple implementation could be to save all chars that get printed into a hashset linked to the current font. Then the font could be trimmed before writing it into the output PDF. That's probably not the most performant way to do this, but it might be fine. Especially if it is an optional feature.

Edit: I just saw that the PDF text operations mostly use the codepoints directly and not the characters. That of course makes this more difficult. In theory it would still be possible to somehow remap the codepoints before finally writing the output PDF, but that seems a bit more tedious

dnlmlr · 2023-03-07T20:01:09Z

Ok I did in fact manage to get automatic subsetting working in printpdf using the allsorts. I implemented this on the main branch of my fork since I needed the feature for my own program.

My implementation runs when ExternalFont::into_with_document is executed. It works by scanning all layers to collect used the glyphs for the current font. Then the font subsetting is executed which can be used to also produce a mapping table from old to new GIDs. That mapping table is then applied to all texts in all layers where the font was used. And of course the subset font gets saved to the PDF file.

Since it was honestly kind of a hassle to work with the current codebase, I merged the currently open PR #131 before implementing this feature. If that gets merged, it would be pretty easy to integrate

fschutt · 2023-03-08T19:45:09Z

@dnlmlr merged, can you rebase / fix? thanks

- Todo: Don't panic!

- Instead of mapping the GIDs via the unicode values, we can just make use of the fact that the new GIDs are issued in the same order as the glyphs_to_keep are provided - This means that the mapping can be significantly simplified as the first *old* GID will have the new GID `1`, the next will have the new GID `1` and so on - This also fixes the issue of glyphs that don't have a unicode value. These couldn't be mapped correctly before. This includes some specific math symbols like for example the glyph `radical.v1`

- This is still not optimal since errors are handled silently and simply cause a fallback to not using subsetting

dnlmlr · 2023-03-08T20:11:56Z

I rebased my current allsorts-based subsetting on master and pushed that for this PR. Please check if this implementation is Ok for you, as it is quite a bit more complex than the previous one that just removed completely unused fonts. The whole subsetting and the inclusion of the allsorts crate is locked behind a feature flag. If the feature is not enabled, there should be no impact on performance or any other metric.

This also contains another cargo fmt pass and an update to the current rust 2021 edition.

fschutt · 2023-03-13T14:07:31Z

lgtm, although I think I should slowly work towards a proper data model for PdfPage, so that manipulation becomes easier

dnlmlr changed the title ~~Don't include unused fonts from the PDF document~~ Don't include unused fonts in the PDF document Mar 5, 2023

dnlmlr added 9 commits March 8, 2023 21:05

Cargo fmt

56a68e2

Update to rust edition 2021

c33b129

Implement glyph-usage based font subsetting

685c96a

- Todo: Don't panic!

Fully omit unused fonts

863dbe6

Make font subsetting configurable at runtime

45a5122

Hide subsetting functions when feature is unused

f000f31

Fix allow_subsetting

b32cbac

Gracefully handle errors during subsetting

248974c

- This is still not optimal since errors are handled silently and simply cause a fallback to not using subsetting

dnlmlr force-pushed the remove-unused-fonts branch from 500e6ef to 248974c Compare March 8, 2023 20:06

fschutt merged commit e412845 into fschutt:master Mar 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't include unused fonts in the PDF document #134

Don't include unused fonts in the PDF document #134

dnlmlr commented Mar 2, 2023

dnlmlr commented Mar 2, 2023 •

edited

Loading

dnlmlr commented Mar 7, 2023

fschutt commented Mar 8, 2023

dnlmlr commented Mar 8, 2023

fschutt commented Mar 13, 2023

Don't include unused fonts in the PDF document #134

Don't include unused fonts in the PDF document #134

Conversation

dnlmlr commented Mar 2, 2023

dnlmlr commented Mar 2, 2023 • edited Loading

dnlmlr commented Mar 7, 2023

fschutt commented Mar 8, 2023

dnlmlr commented Mar 8, 2023

fschutt commented Mar 13, 2023

dnlmlr commented Mar 2, 2023 •

edited

Loading