Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't include unused fonts in the PDF document #134

Merged
merged 9 commits into from
Mar 13, 2023

Conversation

dnlmlr
Copy link
Contributor

@dnlmlr dnlmlr commented Mar 2, 2023

So I have seen some talk about using allsorts to do actual subsetting which would significantly reduce the PDF size. This is not actually removing glyphs but at least it is omitting completely unused fonts from the PDF output.

I am using the PdfLayerReference::set_font function to mark fonts as used and then skip unmarked fonts in FontList::into_with_document. This is a rather simple hack, but it allows for adding all the fonts you want without wasting space on fonts / font variants that are not used at all. The main usecase for me is with the genpdf create where it is required to add all 4 variants of a font (regular, italic, bold, italic-bold) even if not all of them are used.

Let me know if there is something I missed here which could cause problems

@dnlmlr
Copy link
Contributor Author

dnlmlr commented Mar 2, 2023

So I actually looked a bit into subsetting with allsorts and managed to subset the fonts externally before adding them to the PDF already. This is not optimal since it is not using the PDF properties for subsetting and instead basically just creates a new font, but as long it reliably works for reducing the filesize I'd say it is an option.

There currently is a problem with the allsorts subsetting implementation that causes the output font to be missing a few of the data tables. I found that it still works in the PDF files on all devices and programs I tested so far, as long as the font uses the Unicode type for the cmap entries.

I'll look a bit more into this and do some more testing for my own project. Let me know if you would be interested in an implementation of this into the crate. A pretty simple implementation could be to save all chars that get printed into a hashset linked to the current font. Then the font could be trimmed before writing it into the output PDF. That's probably not the most performant way to do this, but it might be fine. Especially if it is an optional feature.

Edit: I just saw that the PDF text operations mostly use the codepoints directly and not the characters. That of course makes this more difficult. In theory it would still be possible to somehow remap the codepoints before finally writing the output PDF, but that seems a bit more tedious

@dnlmlr dnlmlr changed the title Don't include unused fonts from the PDF document Don't include unused fonts in the PDF document Mar 5, 2023
@dnlmlr
Copy link
Contributor Author

dnlmlr commented Mar 7, 2023

Ok I did in fact manage to get automatic subsetting working in printpdf using the allsorts. I implemented this on the main branch of my fork since I needed the feature for my own program.

My implementation runs when ExternalFont::into_with_document is executed. It works by scanning all layers to collect used the glyphs for the current font. Then the font subsetting is executed which can be used to also produce a mapping table from old to new GIDs. That mapping table is then applied to all texts in all layers where the font was used. And of course the subset font gets saved to the PDF file.

Since it was honestly kind of a hassle to work with the current codebase, I merged the currently open PR #131 before implementing this feature. If that gets merged, it would be pretty easy to integrate

@fschutt
Copy link
Owner

fschutt commented Mar 8, 2023

@dnlmlr merged, can you rebase / fix? thanks

dnlmlr added 9 commits March 8, 2023 21:05
- Instead of mapping the GIDs via the unicode values, we can just make
  use of the fact that the new GIDs are issued in the same order as the
  glyphs_to_keep are provided
- This means that the mapping can be significantly simplified as the
  first *old* GID will have the new GID `1`, the next will have the new
  GID `1` and so on
- This also fixes the issue of glyphs that don't have a unicode value.
  These couldn't be mapped correctly before. This includes some specific
  math symbols like for example the glyph `radical.v1`
- This is still not optimal since errors are handled silently and simply
  cause a fallback to not using subsetting
@dnlmlr dnlmlr force-pushed the remove-unused-fonts branch from 500e6ef to 248974c Compare March 8, 2023 20:06
@dnlmlr
Copy link
Contributor Author

dnlmlr commented Mar 8, 2023

I rebased my current allsorts-based subsetting on master and pushed that for this PR. Please check if this implementation is Ok for you, as it is quite a bit more complex than the previous one that just removed completely unused fonts. The whole subsetting and the inclusion of the allsorts crate is locked behind a feature flag. If the feature is not enabled, there should be no impact on performance or any other metric.

This also contains another cargo fmt pass and an update to the current rust 2021 edition.

@fschutt fschutt merged commit e412845 into fschutt:master Mar 13, 2023
@fschutt
Copy link
Owner

fschutt commented Mar 13, 2023

lgtm, although I think I should slowly work towards a proper data model for PdfPage, so that manipulation becomes easier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants